Mistral 7B - The New 7B LLaMA Killer?

Sam Witteveen

28 Sept 202309:36

Summary

TLDRMistral AI is a new AI startup that recently released Mistral 7B, an open-sourced 7 billion parameter model optimized for low latency and outperforming models twice its size. The model uses novel techniques like group query attention and sliding window attention. Benchmarks show Mistral 7B exceeds performance of other popular 7B models on metrics like MMLU and GSM-8k. Some responses are excellent while others mediocre. Overall a promising model, especially once fine-tuned, though lacking in areas like GSM-8k question answering. With commitment to open sourcing, Mistral may release even better models.

Takeaways

😲 Mistral AI raised $113M in seed funding from top investors like Eric Schmidt and Lightspeed.
📈 The Mistral 7B model outperforms larger models like LaMDA and Codex on benchmarks.
🌟 Mistral is focused on low latency, summarization, completion, and code capabilities.
🔎 The model uses group query attention and sliding window attention.
👍 Mistral is committed to open sourcing high quality models.
🤔 The instructor model formatting is important for good responses.
✨ Performance is great on some tasks like analogies and story writing.
😕 Struggles a bit on GSM-8K and factoid questions though.
📦 4-bit quantized versions will be very small and mobile friendly.
⭐ Overall it's worth trying out, especially once fine-tuned versions emerge.

Q & A

What is the model size and key features of Mistral 7B?
-Mistral 7B is a 7 billion parameter model. Key features include support for English and code, low latency, and optimizations for text summarization, completion, and code completion.
How does Mistral 7B compare to models like LaMDA and PaLM in terms of performance?
-The blog post from Mistral shows that Mistral 7B outperforms LaMDA-2, a larger 13 billion parameter model, on metrics like MMLU. It also scores much higher on AGI evaluation compared to LaMDA-1 and LaMDA-2.
What techniques does Mistral 7B use to achieve better performance?
-The model uses group query attention and sliding window attention. These attention mechanisms allow the model to attend to longer contexts of up to 8,000 tokens.
Why is Mistral an important new player in AI?
-Unlike models from Meta so far, Mistral seems committed to open sourcing high quality models with good licenses. This introduces more competition and opportunities for better foundation models.
How well does Mistral 7B perform on complex reasoning tasks?
-Based on the GSM-8k benchmark results, Mistral 7B is able to correctly answer 52% of the questions. This is much higher than comparable 7B models, indicating stronger reasoning abilities.
Does Mistral 7B support an instructor format?
-Yes, there is a separate Mistral 7B instructor model available. It includes an instruction tag wrapper format for guiding the model.
What are some strengths and weaknesses noticed from testing Mistral 7B?
-Strengths include good performance on analogies, email writing, and chat. Weaknesses noticed include inconsistent performance on factoid questions and underperformance on GSM-8k complex reasoning questions.
How does Mistral 7B compare to models like Anthropic's Claude?
-Hard to directly compare as Claude hasn't shared detailed benchmarks. But Mistral open sourcing high quality models introduces healthy competition which will spur continued progress.
Will Mistral 7B be easy to deploy on different hardware platforms?
-Yes, 4-bit quantized versions of the model should allow it to be easily deployed on smartphones and other consumer devices with limited GPU memory.
What are some good next steps for experimenting with Mistral 7B?
-Try fine-tuning the model, integrating a system prompt, playing with smaller quantized versions, and evaluating performance on specific use cases compared to other models.