Based on DeepSeek R1. Is it Better?
Summary
TLDRIn this video, the speaker tests and evaluates OpenAI’s new Open Thinker models, which feature 32B and 7B parameters fine-tuned with a custom dataset, Open Thoughts. The evaluation covers various tasks like storytelling, problem-solving, and creative thinking, with an emphasis on performance and reasoning ability. The video includes detailed scoring based on criteria such as thinking time, output quality, and response length. Despite some inconsistencies, especially with the smaller 7B model, the Open Thinker 32B model performs well overall. The speaker discusses their testing methodology and invites feedback from viewers.
Takeaways
- 😀 Open Thinker models (32B and 7B) are fine-tuned versions of Quen’s 2.5 models using a custom dataset called Open Thoughts, generated with Deep Seek R1.
- 😀 The models were evaluated based on a series of tests, focusing on creativity, logical coherence, correctness, response length, and time to answer.
- 😀 The 32B model performed well overall but was penalized for long thinking stages, impacting response speed and overall score.
- 😀 For the task of writing a story without using the letter 'e,' the 32B model took nearly 2 minutes to think but provided a correct response.
- 😀 The 7B model underperformed compared to the 32B model, showing noticeable delays in many tasks.
- 😀 When both models were evaluated at different quantizations (Q4, Q8, FP16), the performance difference between the 32B and 7B models was reduced, but the 32B model still outperformed.
- 😀 Thinking time was factored into the model evaluation, with penalties applied when the response took too long to begin (over 5 seconds).
- 😀 For the task of designing a new animal, the 32B model generated a unique and effective hybrid of a rat, pigeon, and owl, with appropriate survival strategies.
- 😀 The 32B model’s output was consistently high-quality, but some responses were lengthy due to the extended thinking phase, affecting final scores.
- 😀 The reviewer intentionally avoided prompt engineering and configuration tweaks to assess the models’ raw capabilities and to ensure fair evaluation.
Q & A
What is the Open Thinker model, and what parameters do the 32B and 7B models have?
-The Open Thinker model includes two variants: a 32 billion parameter model and a 7 billion parameter model. Both are fine-tuned versions of the Quen 2.5 models with 32B and 7B parameters, respectively.
What is the Open Thoughts dataset used for, and how was it generated?
-The Open Thoughts dataset is used to fine-tune the Open Thinker models. It was created by using Deep Seek R1 to generate questions and answers. The questions were selected from various benchmarks and then refined using a tool called Curator, which ensures the data is accurate and well-structured.
How are math and code-related questions handled in the Open Thoughts dataset?
-For math and puzzle-related questions, the answers are verified by an LLM judge against known answers. For code questions, the code is executed and unit tested to ensure accuracy.
How does the grading process for the Open Thinker models take time into account?
-The grading process considers the model's response time, applying penalties if the model takes longer than certain thresholds. For example, responses over 5 seconds result in a 1-point deduction, and responses over 2 minutes result in a 6-point deduction.
What is the hardware setup used to run the Open Thinker models during testing?
-The machine used for testing has 8 H100 GPUs, each with 80GB of VRAM, and more than a terabyte of regular RAM. This powerful setup helps to minimize delays in the models' responses.
What was the performance of the Open Thinker 32B model on the task of writing a three-sentence story about a cat chasing a mouse?
-The Open Thinker 32B model took almost 2 minutes to start thinking and generated a simple, valid story that adhered to the constraint of not using the letter 'e'. It scored 17 out of 21, and disregarding the thinking stage, it gained 3 points, but the time penalty led to a total of 17 points.
How did the Open Thinker 32B model perform on the question about stacking playing cards?
-The model initially struggled with the question and took over 4 minutes to think, but after adjustments to the context length, it took about 2.5 minutes. The final answer was comprehensive, scoring 18 out of 21 with time penalties dropping it to 15 out of 21.
What is the significance of the Open Thinker 32B model's performance in the analogy question, 'if apple is the fruit, as hammer is the tool'?
-The model performed well by providing several great options for analogies and producing a strong final answer: 'tree is to plant as car is to vehicle.' Despite a 95-second thinking period, the answer was highly relevant and met expectations, scoring 50 out of 60.
How did the Open Thinker model perform when asked to explain quantum entanglement to a 5-year-old?
-The model used a kitchen-based analogy involving cookies and a cookie jar to explain quantum entanglement. However, the explanation implied that the described phenomenon could actually happen, which was not accurate. It scored 63 out of 81 with some deductions for correctness and logical coherence.
What was the performance of the 7B model compared to the 32B model?
-The 7B model performed noticeably worse than the 32B model. Despite the two models being similar in structure, the 7B model provided less accurate or complete responses, and its performance was much weaker overall.
Outlines

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowMindmap

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowKeywords

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowHighlights

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowTranscripts

This section is available to paid users only. Please upgrade to access this part.
Upgrade NowBrowse More Related Video

OpenThinker (Fully Tested): This NEW REASONING MODEL is QUITE CRAZY!

Is it really the best 7B model? (A First Look)

Mistral 7B - The New 7B LLaMA Killer?

Grok-1 FULLY TESTED - Fascinating Results!

Reflection 70B (Fully Tested) : This Opensource LLM beats Claude 3.5 Sonnet & GPT-4O?

SEPERTI INILAH KARAKTER ORANG YANG BERPIKIR KRITIS | Dr. Fahruddin Faiz | Ngaji Filsafat
5.0 / 5 (0 votes)