Mistral 7B - The New 7B LLaMA Killer?

Sam Witteveen
28 Sept 202309:36

Summary

TLDRMistral AI is a new AI startup that recently released Mistral 7B, an open-sourced 7 billion parameter model optimized for low latency and outperforming models twice its size. The model uses novel techniques like group query attention and sliding window attention. Benchmarks show Mistral 7B exceeds performance of other popular 7B models on metrics like MMLU and GSM-8k. Some responses are excellent while others mediocre. Overall a promising model, especially once fine-tuned, though lacking in areas like GSM-8k question answering. With commitment to open sourcing, Mistral may release even better models.

Takeaways

  • 😲 Mistral AI raised $113M in seed funding from top investors like Eric Schmidt and Lightspeed.
  • 📈 The Mistral 7B model outperforms larger models like LaMDA and Codex on benchmarks.
  • 🌟 Mistral is focused on low latency, summarization, completion, and code capabilities.
  • 🔎 The model uses group query attention and sliding window attention.
  • 👍 Mistral is committed to open sourcing high quality models.
  • 🤔 The instructor model formatting is important for good responses.
  • ✨ Performance is great on some tasks like analogies and story writing.
  • 😕 Struggles a bit on GSM-8K and factoid questions though.
  • 📦 4-bit quantized versions will be very small and mobile friendly.
  • ⭐ Overall it's worth trying out, especially once fine-tuned versions emerge.

Q & A

  • What is the model size and key features of Mistral 7B?

    -Mistral 7B is a 7 billion parameter model. Key features include support for English and code, low latency, and optimizations for text summarization, completion, and code completion.

  • How does Mistral 7B compare to models like LaMDA and PaLM in terms of performance?

    -The blog post from Mistral shows that Mistral 7B outperforms LaMDA-2, a larger 13 billion parameter model, on metrics like MMLU. It also scores much higher on AGI evaluation compared to LaMDA-1 and LaMDA-2.

  • What techniques does Mistral 7B use to achieve better performance?

    -The model uses group query attention and sliding window attention. These attention mechanisms allow the model to attend to longer contexts of up to 8,000 tokens.

  • Why is Mistral an important new player in AI?

    -Unlike models from Meta so far, Mistral seems committed to open sourcing high quality models with good licenses. This introduces more competition and opportunities for better foundation models.

  • How well does Mistral 7B perform on complex reasoning tasks?

    -Based on the GSM-8k benchmark results, Mistral 7B is able to correctly answer 52% of the questions. This is much higher than comparable 7B models, indicating stronger reasoning abilities.

  • Does Mistral 7B support an instructor format?

    -Yes, there is a separate Mistral 7B instructor model available. It includes an instruction tag wrapper format for guiding the model.

  • What are some strengths and weaknesses noticed from testing Mistral 7B?

    -Strengths include good performance on analogies, email writing, and chat. Weaknesses noticed include inconsistent performance on factoid questions and underperformance on GSM-8k complex reasoning questions.

  • How does Mistral 7B compare to models like Anthropic's Claude?

    -Hard to directly compare as Claude hasn't shared detailed benchmarks. But Mistral open sourcing high quality models introduces healthy competition which will spur continued progress.

  • Will Mistral 7B be easy to deploy on different hardware platforms?

    -Yes, 4-bit quantized versions of the model should allow it to be easily deployed on smartphones and other consumer devices with limited GPU memory.

  • What are some good next steps for experimenting with Mistral 7B?

    -Try fine-tuning the model, integrating a system prompt, playing with smaller quantized versions, and evaluating performance on specific use cases compared to other models.

Outlines

00:00

😊 Introducing Mistral AI and its new 7B model

This paragraph introduces Mistral AI, a new AI company formed by researchers from DeepMind and Meta. It raised $113 million to buy GPUs and build models. They have now released Mistral 7B, a 7 billion parameter model that outperforms larger models like LaMDA. It supports text completion, summarization and code completion. Mistral seems committed to open sourcing high quality models.

05:00

👨‍💻 Testing out the Mistral 7B instructor model

This paragraph shows example code for using the Mistral 7B instructor model. It uses a special prompt format to denote instructions and responses. The model is tested on analogies, email writing, capital city questions, conversations between people, story generation and more. Performance seems good but inconsistent across runs. The GSM-8K performance seems weaker. Overall it's an promising model worth trying out, especially the smaller 4B and 8B versions.

Mindmap

Keywords

💡Mistral AI

Mistral AI is the name of the new AI startup company that recently raised $113 million in seed funding. They released their first model called Mistral 7B which seems to significantly outperform other models of similar size like LaMDA. The video analyzes this new model and company.

💡seed funding

Mistral AI raised $113 million in seed funding from famous investors to help launch the company. This large amount for a new startup attracted attention and signaled the potential capability of their models.

💡Mistral 7B model

The Mistral 7B model is the first model released by Mistral AI. It has 7 billion parameters but seems to outperform much larger models in benchmarks. It has an 'instructor' fine-tuned version for more robust performance.

💡LLaMA models

LLaMA models refer to large language models released by Meta/Facebook previously. Mistral 7B outperforms equivalent sized LLaMA models and even some much larger ones, showing the capability of Mistral's approach.

💡benchmarks

Various benchmarks like MMLU and GSM-8k are used to quantitatively measure the performance of large language models on tasks like question answering. Mistral 7B scores highly on several benchmarks.

💡code generation

One capability highlighted for Mistral 7B is code generation and completion which may have contributed to high AGI eval scores. The video tests code generation queries with the model.

💡sliding window attention

A technique used in Mistral 7B along with group query attention to optimize attention over long contexts (8000 tokens). This likely helps with summarization and completion tasks.

💡model comparison

The video does hands-on testing and comparison of Mistral 7B with past demo scripts on models like InstructGPT and compares performance on basic tasks.

💡chat

The video finds that Mistral 7B does a decent job on open chat though can be inconsistent between runs. An area for further improvement.

💡question answering

Performance on question answering seems mixed with both good factual answers but also some incorrect or hallucinated responses indicating room for tuning on robustness.

Highlights

Mistral AI raised $113 million in seed funding from top investors like Eric Schmidt and Lightspeed.

Mistral 7B model outperforms larger models like LaMDA-2 13B despite being almost half the size.

Mistral 7B uses group query attention and sliding window attention for better performance.

Mistral 7B instructor model beats LaMDA-2 and other 7B models on benchmarks.

Mistral is committed to open sourcing high quality models which opens new opportunities.

Need to install latest Transformers and use special prompt formatting for instructions.

Model does well on math, code generation and analogies but struggles on GSM-8K tasks.

Performance is snappy even on 7B model, runs well on A100 and likely on T4 GPUs.

Writing emails and chats works well but QA can be hit or miss.

No system prompt needed, just use instruction prompt wrapping.

Likely smaller 4-bit quantized versions will enable mobile and edge usage.

Model is worth trying out, will likely improve with fine-tuning.

Put highlights in JSON format with text and start time fields.

Generated 15 highlights with texts and start times in JSON format.

Let me know in comments if you have any other questions!

Transcripts

play00:00

Okay.

play00:00

So Mistral AI is a company that sort of burst onto the scene late may and

play00:05

early June when they raised around $113 million for their seed round.

play00:11

And at the time people were quite vocal about that how could an unknown company

play00:17

suddenly raise so much money and not only that, it had a lot of famous investors,

play00:22

people like Eric Schmidt Lightspeed the VCs who are behind the mobile app snap

play00:28

and a variety of other VCs as well.

play00:30

And what it turned out at the time was that, basically, this was a group of

play00:34

researchers from DeepMind from Meta.

play00:38

and they were getting together to basically build a new AI company.

play00:42

And the reason for the large raise was mostly to go towards

play00:47

buying GPUs apparently.

play00:49

Well, jump ahead a few months we've now got the first model that they've

play00:54

actually released and this is Mistral 7B.

play00:57

So it's a small model compared to others.

play01:00

But it's very much punching above its weight.

play01:02

So overall to sum up the model it's a 7 billion model.

play01:06

There are two versions of this.

play01:08

there is one that is basically just the base model and there's one that

play01:11

has an instructor, fine tune model.

play01:13

If we come down and have a look at this, we can see that the

play01:15

model supports English and code.

play01:18

And goes out to an 8K context length window here.

play01:21

the license is, Apache 2.

play01:24

And the model's been optimized for, low latency, text summarization, text

play01:28

completion and code completion here.

play01:31

They've released a blog post as well as actually releasing

play01:34

the model on hugging face.

play01:36

So before we jump in and have a look at the model itself, just

play01:38

quickly looking at their blog posts.

play01:40

you can see Mistral 7b in short, they're claiming that this,

play01:44

outperforms, LLaMA-2 13 billion.

play01:47

So a model that's almost twice as big as the 7 billion it

play01:51

outperforms LLaMA-1 34 billion.

play01:54

So you can see here, the performance is definitely a lot better than

play01:59

the LLaMA-2 models for this.

play02:01

Now It does seem that the model in many ways is similar to the

play02:04

LLaMA-2 models with the amount of tokens and the sizing, et cetera.

play02:09

But it does seem that they've found a way to squeeze out a lot more

play02:13

performance for that particular size.

play02:16

So in the blog posts, they mentioned that they're using a group query attention.

play02:20

They using sliding window attention.

play02:23

and they also publish some stats that we can have a look at here.

play02:26

So you can see here, they've basically got a graph of the performance in detail

play02:31

For the different benchmarks, comparing the Mistral 7B to the LLaMA 7B the

play02:37

LLaMA-2 13B and also the LLaMA-1 34B here.

play02:42

So based on these graphs, we can see that it's doing very well with the MMLU scores.

play02:46

apparently the model can do both English and code and perhaps the ability to do

play02:52

code has helped it very much with the AGI eval scores where it seems to be

play02:57

scoring much higher then the two LLaMA-2 models and the LLaMA-1 34b model there.

play03:04

Another metric that I think is really interesting here is how will it does on

play03:07

the GSM 8K benchmark, which I've talked about in some of the other videos here.

play03:11

And you can see that here, it's getting a 52% far above the

play03:16

LLaMA-2 7B and a LLaMA-2 13B here.

play03:20

And also far above the fine tune CodeLLaMA, 7B here, which

play03:25

is very interesting to look at.

play03:27

So in here, they've got a little bit about the sliding window attention and how that

play03:31

basically attends to, the 4,000 tokens.

play03:34

We can also see that they've actually released a, chat model for this

play03:38

or an instructor model for this.

play03:40

So they're calling this the Mistral 7B instructor model.

play03:43

And they show that not only is this beating all the other 7B models that it's

play03:47

actually doing better than a lot of the 13B models with only perhaps WizardLM

play03:53

13B B and a Vicuna 13B beating it here.

play03:57

And one of the good things with this is it seems that Mistral is definitely

play04:00

committed to, open sourcing models.

play04:02

Perhaps we're gonna see better and bigger models from them in the future.

play04:07

So it is very nice that up until now, we've really had to rely

play04:10

on Meta releasing some of these foundation models with good licenses.

play04:15

that are actually very high quality.

play04:17

It does seem now that there's another player on the scene, which, opens up

play04:21

a whole bunch of new opportunities With other kinds of models as well.

play04:25

So let's jump into the code and have a look at how the

play04:28

Mistral 7B actually performs.

play04:31

Okay, so I'm going to go quickly through the Mistral 7B instruct here.

play04:37

one of the key things you want to make sure is that you install

play04:39

hugging face transformers from git

play04:41

hub to make sure that you've got the latest version there.

play04:44

And once you've got that you can bring in the model and the tokenized.

play04:48

just like, before.

play04:50

And we can see if we look on the hugging face hub here, we can see

play04:54

their instructions for doing it.

play04:57

including the instruction.

play04:58

format that they're doing.

play05:00

so that's going to be key in here as well.

play05:03

Okay.

play05:03

So the prompt format that they're using basically is you wrap

play05:07

things, in this instruction tag.

play05:10

if there is an assistance response, you will then basically get a

play05:15

end of response tag back or an End of text tag back like that.

play05:20

so I've just put together a very simple little generate a function that basically

play05:25

wraps our instructions in this way.

play05:28

Takes those puts them through a tokenizer.

play05:31

Encodes them.

play05:32

And puts them on here.

play05:34

So, I kind of reused the Phi 1.5.

play05:38

Notebook that I had recently.

play05:39

So there were a number of things in that that were code gen.

play05:42

So I thought I'd start off with that.

play05:44

And it seems like, okay, it's doing.

play05:46

so interesting code Gen, in here for this where, Does

play05:51

generate functions, pretty well.

play05:53

for checking prime numbers, et cetera.

play05:56

Though running through them at times, some are hits some amiss.

play06:00

and that's generally how I found the, responses overall is that some of them

play06:05

are really good, but then if you rerun it, you can get a very so-so response.

play06:09

quite often as well.

play06:11

I so you can hear here, I've asked it, some of the things from other Phi, 1.5.

play06:17

Write.

play06:18

A detailed analogy between mathematics and music.

play06:21

and it does quite nicely at that it's running out of tokens.

play06:24

at the end.

play06:25

but it's definitely snappy performance.

play06:27

So I'm running this on an A100.

play06:30

Because they recommended that you use at least 24.

play06:33

GB of Ram.

play06:35

but I think it would actually fit probably on the T4 as well.

play06:40

And certainly it will fit on the T4 as a eight bit or four bit.

play06:45

Here.

play06:46

Okay.

play06:46

So some standard questions that I ask.

play06:49

Normally like the llama Vicuna alpaca.

play06:52

it does quite nicely with this at times, but then also certain

play06:55

generations didn't do as well as this.

play06:58

the writing an email to Sam Altman.

play07:00

I thought this one generally came out pretty good.

play07:03

you want to make sure you give it some extra tokens.

play07:06

it seems to want to actually use those tokens for something like this.

play07:11

as we're going through now, Questions, like the, what is the capital of England?

play07:15

I found it to be a little bit hit and miss sometimes it would just

play07:18

give you a very succinct answer.

play07:20

Sometimes it would give a very long answer.

play07:22

Questions like, can Geoffrey Hinton have a conversation with George Washington.

play07:27

give rationale before answering.

play07:29

this kind of question, it actually seems to handle, quite well and actually eats

play07:32

probably better than a lot of the other, seven B models are out there also for.

play07:38

making up stories.

play07:39

This seemed to be quite good as well.

play07:42

Chat, it seems to do.

play07:43

quite good at completing chats.

play07:45

what I did find it to be lacking.

play07:48

is in the GSM eight K stuff.

play07:50

So even just the simple question of the cafeteria.

play07:55

and my guess, is that okay?

play07:57

I think on the stats, they were saying that this model is getting 52%.

play08:02

Right.

play08:03

So certainly the ones I've given it seems to get wrong.

play08:07

Well, they did get this one right at times.

play08:11

so I found that sometimes he got it right.

play08:13

Sometimes he got it wrong.

play08:14

all this.

play08:14

The times I ran this one that got it wrong, even though it works out that,

play08:18

you've got three plus six, which is great.

play08:22

But then three plus six doesn't equal 29.

play08:25

so it's sort of off base on some of those

play08:28

Overall, I'd say the model is certainly worth giving a shot

play08:31

and having a play with it.

play08:33

I suspect that we may get, some really good, fine tunes of this model.

play08:38

once people sort of work out how to tune it and stuff.

play08:41

I also found that kind of interesting that it's not using a system prompt at all.

play08:44

It's just basically using this instruction prompt that goes in here.

play08:48

So originally I had my code for system prompt.

play08:51

I've taken that out.

play08:52

as we've gone through, but you could play with putting a system

play08:55

prompt at the start and see, okay.

play08:57

Does that influence it in any way?

play09:00

Anyway, overall, have a play with the model yourself.

play09:02

See what you think of it.

play09:04

My guess is that the 4-bit versions of this are going to be very small.

play09:09

and be able to easily run on, phones and other devices, Which makes it a

play09:14

very appealing model for a variety of different tasks for this kind of thing.

play09:19

Anyway as always, if you've got anything.

play09:22

to say or any questions, please put them in the comments below.

play09:25

If you're interested in videos about large language models, I've

play09:28

got a bunch of these coming up.

play09:30

so please click like and subscribe.

play09:32

I will talk to you in the next video.

play09:33

Bye for now.