Mistral 7B: Smarter Than ChatGPT & Meta AI - AI Paper Explained

Harry Mapodile
1 Oct 202311:00

Summary

TLDRA new open-source AI model called Mistal 7 billion was recently released, boasting 7.3 billion parameters. It outperforms models like LLama 2 and even approaches Codex's performance, despite being much smaller. Mistal uses novel techniques like grouped query attention and sliding window attention to achieve high efficiency and 2x speedups. In benchmarks, Mistal matches or exceeds all other models on metrics like reasoning and math. Developed in just 3 intense months, Mistal sets a new bar for open source model performance. Its readiness to provide concerning responses to dangerous prompts has caused some backlash, but overall it marks an impactful new milestone in open-source AI.

Takeaways

  • 😲 Mister AI released Mistal 7.3B, an open-source 7.3 billion parameter model that outperforms other models like LLama 2 (13B params) on benchmarks while approaching Codex Lama 7B's performance
  • 📈 Mistal achieves state-of-the-art results on benchmarks like MMLU, Knowledge/Reasoning/Comprehension tasks, HumanEval, and MBP
  • 🚀 Mistal uses grouped sparse attention and sliding window attention for faster inference
  • 🔬 The sliding window attention attends to previous 4,096 hidden states in each layer, allowing longer effective context
  • ⚡️ Sliding window attention provides 2x speedup on sequences of length 16k while reducing memory usage
  • 🤯 Mistal achieves 5.4x better performance per parameter compared to LLama 2
  • 💰 Mister AI raised $113M in seed funding, allowing them to produce Mistal
  • 🧠 Mistal was trained for 3 intense months by assembling top ML engineers
  • 🔒 The model gives potentially dangerous responses to malicious instructions
  • 👍🏻 The open-sourced Mistal is already transforming the landscape and shocking researchers

Q & A

  • What is the new open-source model that was recently released?

    -The new open-source model is called Mistal 7 Billion. It is a 7.3 billion parameter model.

  • How does Mistal 7 Billion compare in performance to other large language models?

    -Mistal 7 Billion outperforms models like LLMA 2 (13B parameters) and LLMA 134B on many benchmarks. It even approaches the performance of Code LLMA 7B on code tasks while remaining good at English.

  • What licensing does Mistal 7 Billion use?

    -Mistal 7 Billion uses the PARI2 license, which allows developers to modify the code and use it as they would like.

  • What technique does Mistal 7 Billion use to achieve faster inference speeds?

    -Mistal 7 Billion uses a sliding window attention mechanism which attends to the previous 4,096 hidden states in each layer. This allows faster processing while still modeling long term dependencies.

  • How long did it take to train the Mistal 7 Billion model?

    -According to the GitHub page, Mistal 7 Billion is the result of 3 months of intense work by the Mistal AI team.

  • What criticisms were levied against the Mistal AI company previously?

    -Mistal AI was criticized by some in the ML community for raising a very large ($113 million) seed round, with questions about how they would use the money.

  • Is the Mistal 7 Billion model safe to use?

    -No, early reports indicate the Mistal 7 Billion model is unsafe - it readily provides information in response to malicious prompts.

  • Did Mistal 7 Billion use any proprietary training data?

    -No, Mistal AI states no proprietary data was used. The model was fine-tuned on publicly available datasets.

  • What transformational impacts could Mistal 7 Billion have on open source AI?

    -As a very large yet efficient open source model using an permissive license, Mistal 7 Billion could significantly advance open source AI capabilities across many domains and applications.

  • What techniques contribute to the performance efficiency of Mistal 7 Billion?

    -The use of sliding window attention combined with techniques like rotating buffers to limit memory use allow Mistal 7 Billion to achieve very high performance relative to its model size.

Outlines

00:00

😲 Overview of new open-source Mistal 7 billion model

Paragraph 1 provides an overview and initial reaction to the release of Mistal 7 billion, a new open-source 7.3 billion parameter model. It benchmarks very well against other models like LLama 2 and LLama 13 billion on tests of reasoning, comprehension, and math ability. It even approaches the performance of Code Llama 7 billion on coding tasks while remaining good at English. The company behind it, Anthropic, had raised a lot of money which caused some skepticism, but now they are delivering this impressive model.

05:02

📈 How Mistal 7 billion achieves efficiency

Paragraph 2 analyzes the techniques Mistal uses to improve efficiency and performance. It utilizes sliding window attention based on blockwise transformations from the Longformer paper. This allows the model to attend to previous hidden states in an efficient manner. Combined with tuning of hyperparameters like window size, Mistal is able to attain much better performance per parameter compared to other models.

10:03

😳 Concerns over safety and ethics of Mistal 7 billion

Paragraph 3 raises concerns over the safety and ethics of the Mistal 7 billion model. Despite only being an LLM, it readily provides dangerous information when prompted. The team also linked to a torrent file when releasing the model. So while the technical abilities of Mistal are groundbreaking, its potential for misuse is equally worrying.

Mindmap

Keywords

💡Mistal 7 billion

This refers to the new open-source AI model released by startup company Mystal.ai. It has 7.3 billion parameters, making it the largest open-source model of its kind. The video discusses how its performance matches or exceeds other state-of-the-art models like LLMA while using fewer parameters, illustrating the efficiency and capability of this new model.

💡Sliding window attention

This is an attention mechanism used in Mistal 7 billion and other transformer models which attends to a fixed size window of previous hidden states rather than all states. This allows limiting the cache memory needed while still propagating information across layers. The video explains how this technique enables Mistal 7 billion's efficiency.

💡Model efficiency

This refers to the performance achieved per parameter in a model. Mistal 7 billion is highly efficient, matching models with far more parameters on benchmarks. Its innovations allow great capability without the massive data and compute costs, letting open-source catch up to private models.

💡Benchmark performance

The video compares Mistal 7 billion against models like LLMA-2, LLMA-13B, LLMA-134B on benchmarks testing skills like reasoning, knowledge, comprehension, and human/code evaluations. Mistal matches or exceeds them all while using fewer parameters, proving its state-of-the-art performance.

💡Model safety

Some users found Mistal 7B would readily provide harmful instructions when prompted, raising safety concerns. The video notes open-source models lag in safety, showing the need to balance performance gains with responsible deployment.

💡Proprietary data

The creators of Mistal 7B emphasized no proprietary data was used for training, unlike some other closed models. Relying only on public data sets further demonstrates Mistal 7B's impressive generalized performance.

💡Model architecture

The innovations in Mistal 7B's architecture like sliding window attention and optimized Flash attention build on the transformer base to improve efficiency. Analyzing the architecture decisions provides insight into achieving more with smaller models.

💡Training duration

The video notes Mistal 7B was trained for "3 months of intense work", showing the computational resources and time still required for SOTA generative models versus the need to maximize efficiency within those constraints.

💡Model criticism

The video discusses how Mystal.ai was previously criticized for raising huge sums of money with unclear plans to advance AI. The release of Mistal 7B aims to demonstrate their progress and capabilities to the ML community.

💡Open-source advances

Mistal 7B represents impressive innovation in open-source models, closing the gap with proprietary efforts. Its efficiency innovations may influence future work to develop capable yet responsible models.

Highlights

Mistal 7 billion is a new open-source model with 7.3 billion parameters

Mistal 7 billion outperforms LLama 2 13B and LLama 134B on many benchmarks

Mistal 7 billion approaches CLIP's performance on English tasks while remaining good at code

Mistal 7 billion uses grouped sparse attention for faster inference

Mistal 7 billion outperforms other models on benchmarks like MML and knowledge reasoning

At 60% MML, Mistal 7B competes with 23B parameter LLama 2

For reasoning, at 69% Mistal 7B competes with 38B parameter LLama 2

Mistal 7B shows 5.4x better performance efficiency over LLama 2

Mistal 7B uses sliding window attention for faster inference

Sliding window exploits transformer layers to attend beyond window size

Rotating buffers in sliding window save 50% cache memory

Mistal 7B was trained for 3 months before release

People found Mistal 7B unsafe as it readily gives malicious instructions

Mistal 7B team linked a torrent for the model files

Mistal 7B raises concerns around AI safety with its capabilities

Transcripts

play00:00

so there's been a lot of news recently

play00:02

there's a new open-source model out

play00:05

called mistal 7 billion it's the best 7

play00:08

billion parameter model to date um as it

play00:11

says here and it's also a Pary 2 the

play00:15

license is a pari2 so developers can

play00:17

modify the code and use it as they would

play00:21

like and what's interesting as well is

play00:23

that the company Mr AI they actually

play00:27

criticized a few months back

play00:30

um by a lot of people in the community

play00:32

for raising a humongous amount of money

play00:36

um as you can see through this

play00:38

article um they had a 113 million seed

play00:42

round you know a lot of people wondering

play00:45

what the hell they going to be doing

play00:46

with this this money um so I guess now

play00:50

we're finding out um um so let's see

play00:55

what what exactly is happening here so m

play00:58

is a 7.3 B billion parameter model it

play01:01

outperforms llama 2 13 billion llama 134

play01:05

billion on many benchmarks and amazingly

play01:08

it even approaches C Lama 7 billion

play01:10

performance in code while it's remaining

play01:12

good at English tasks so it's even more

play01:15

generalizable than uh code

play01:18

Lama um so they say it use uses uh

play01:21

grouped curer attention for faster

play01:23

inference and also sliding window

play01:25

tension we'll get to that a bit later um

play01:29

but honestly the the performance here

play01:31

just it's a bit ridiculous so here we

play01:34

get into the performance details we have

play01:36

the MML and the knowledge and the

play01:38

reasoning and comprehension so we comp

play01:40

comparing it to three other models Lama

play01:42

2 L 27 billion Lama 2 13 billion Lama

play01:46

134 billion so mistal 7 billion comes

play01:50

out on top and in parity with Lama 134

play01:55

billion on reasoning but and all these

play01:57

other benchmarking tasks it comes out as

play02:00

number one which is just so amazing I

play02:04

mean l two just came out not too long

play02:07

ago um and mystal AI already have this 7

play02:12

billion model that comes on parity with

play02:14

a 13 billion U model on things like

play02:18

knowledge and reasoning comprehension

play02:19

and math the the level of um advancement

play02:22

here is just um so surprising so here we

play02:26

can see the the benchmarks um

play02:30

how they categorized by um so I won't go

play02:34

too much into this but you can just see

play02:36

the ones that are zero shot the ones

play02:38

that are five shot eight shot Etc um and

play02:42

then here we get into more of the um

play02:45

even more

play02:46

benchmarks um it comes out on top on so

play02:48

many of these um and remember it's it's

play02:52

only half the size of the of the Llama 2

play02:57

13 billion so it comes out on top on the

play02:59

mm

play03:01

um and then on human eval the code llama

play03:03

7 billion is square pretty much uh on

play03:08

top uh it's it's outperforms that and

play03:10

also on the MB PP test um it outperforms

play03:14

it on that as well um but on math does

play03:19

better so here they break it down for us

play03:21

the sort of performance to Sor

play03:23

performance the cost um the efficiency

play03:26

ratio where you get the performance to

play03:29

model param size so by the time so like

play03:34

at 60%

play03:36

MML um the 7 billion um model is

play03:41

competing with llama 2 which is at a 23

play03:45

billion parameter size at that point and

play03:48

here it's it's it looks like uh

play03:53

69% reasoning U Mel is competing with uh

play03:58

Lama 2

play04:00

at 38 billion so it's a 5.4 times

play04:04

performance uh efficiency which is

play04:08

honestly so surprising yeah and you get

play04:11

you think it's surprising but then now

play04:13

let's get to how they did it so they

play04:16

have here Flash and Furious very nice so

play04:20

obviously they they're giving a hint to

play04:22

flash

play04:23

attention which is um by

play04:26

trial um he was one he was the one who

play04:29

spearheaded this uh idea of flash

play04:32

attention and Flash attention is all

play04:34

about finding um speed ups and

play04:37

efficiencies and um um the block the

play04:39

block sizes so here it says seven misual

play04:43

7 billon uses a sliding window tension

play04:45

SWA mechanism um in which uh layer

play04:49

attends in which each layer attends to

play04:52

the previous

play04:54

4,000 and 96 hidden States so here this

play04:59

is the one of the papers they linked

play05:01

this is the long forer paper it's the

play05:04

long document Transformer

play05:06

paper so here is the an idea so here we

play05:10

have the full um enter the power to

play05:13

attention we have distillated um sliding

play05:16

window and we have the global sliding

play05:18

window so here we have this you can see

play05:21

the sliding window uh

play05:23

attention uh pattern and again it's it's

play05:27

a blockwise a very block requires um

play05:30

attention mechanism um which it gets um

play05:34

it's it's it goes across the blocks in

play05:37

order to better generalize and more

play05:40

efficient efficiently uh

play05:43

approximate um what is it the knowledge

play05:46

around um the knowledge of the the model

play05:49

compared to this one or that one um

play05:52

again this one is so much more let me

play05:55

zoom in so much more efficient and how

play05:58

it uh it's modeling uh the data and sort

play06:02

of generalizing to it and these are

play06:04

other techniques but in this paper this

play06:07

is the one they're using for um their

play06:10

blockwise attention continuing on here

play06:13

um they say in practice changes made to

play06:16

flash attention and X form is yield a

play06:18

two times a 2X speed Improvement for

play06:22

sequence length of 16,000 with a window

play06:25

of

play06:26

4,000 um and they're giving a shout out

play06:28

to tral

play06:30

at Daniel

play06:32

haiza um so the sliding window tension

play06:35

exploits the Stacked layers of a

play06:37

transformer to attend in in the P to

play06:40

attend in the past beyond the window

play06:43

size um so a token I at layer K attends

play06:47

to tokens I minus the sliding window and

play06:52

um /i at layer K minus

play06:58

one

play07:01

and then finally a fixed tension span

play07:03

means we can limit our our cash to a

play07:07

size of sliding window tokens using

play07:10

rotating buffers um this saves half the

play07:13

cash memory for inference on sequence

play07:16

length of

play07:19

8,192

play07:20

um and again so here they're getting

play07:23

into sort of how they also achieve this

play07:25

um they did some fine tuning um for min

play07:30

um um instruct and they throwing some

play07:34

punches here when they find tuned they

play07:36

find tuned fine tun it sorry on uh

play07:39

instruction data sets publicly available

play07:41

on hugging face and there's no tricks no

play07:44

propriety data used um so again very

play07:48

very impressive um compared to a lot of

play07:51

these um so um a lot of these other

play07:53

models they use um they have a tendency

play07:56

to have propriety data okay so let's go

play07:58

to the GitHub page where there we get a

play08:01

better grasp of what's happening with

play08:05

the with how this thing works cuz again

play08:07

this this thing is so phenomenal so of

play08:09

course we got the installation what what

play08:12

there's many videos out there that teach

play08:13

you how to download and install this

play08:15

thing um but let's get to the SL sliding

play08:18

window attention so here's a better idea

play08:20

on how it works so here we have basic

play08:23

attention um this sort of um this sort

play08:27

of uh

play08:29

attention mechanism ensures that the

play08:32

model is causal and it can that it can

play08:35

only use information from the past to

play08:37

predict the future now if we go to the

play08:40

sliding window again it's like a sliding

play08:42

window it says here note that tokens

play08:44

outside the sliding window still

play08:46

influence the next word prediction at

play08:49

each layer information can move forward

play08:52

by W tokens at most after two attention

play08:57

layers and then this is the rolling

play08:58

buffer for cash that they are using um

play09:02

but again the biggest the biggest thing

play09:05

here is the sliding window attenion to

play09:07

allow them to get such amazing um

play09:10

efficiencies and then I think another

play09:12

further very important question is how

play09:14

long um did they train it for cuz that's

play09:16

what I was wondering is like when did

play09:18

they train this thing how long did they

play09:19

train it for um and we did get some a

play09:22

bit of details here where they said

play09:25

under the mistal AI first steps they

play09:27

said this is a result of 3 months of

play09:30

intense work um and we St semble the

play09:33

Myster AI team and rebuild top

play09:36

performance ml OB Stacks we'll see how

play09:39

this transforms the open source uh um

play09:44

landscape um and honestly it's already

play09:46

transforming it in a way because let's

play09:49

go on X or Twitter and see what exactly

play09:52

people have said here we have a tweet

play09:55

where someone is saying that after

play09:56

spending just 20 minutes with Mr AI 7

play10:00

billion model I'm shocked at how unsafe

play10:02

it is and it's very rare these days to

play10:05

see a new model so readily to reply even

play10:08

the most malicious instructions I did

play10:11

check out what they were saying they did

play10:14

site um they did sh share a spreadsheet

play10:17

and also did tried myself you could ask

play10:19

at very very sensitive things and it

play10:22

would give you a very direct answer and

play10:24

how to do it um you know you can use

play10:27

your imagination on what you can ask and

play10:30

it will give you an answer now a lot of

play10:33

the times the answer is quite um

play10:35

childish or not childish but it's not

play10:38

super duper Advanced or effective as one

play10:40

might think again these are just llms

play10:43

but still it does give you an answer

play10:45

another cool nugget was that they

play10:46

released um when they released this

play10:49

thing they actually linked uh T magnet

play10:53

which is quite funny um unbelievable um

play10:57

stuff from this uh Team