Gemini Flash Thinking 2.0, o3-mini-high, and DeepSeek-R1 Solve My Humanity's Last Exam Entry Problem

Kyle Kabasares

6 Feb 202510:58

Summary

TLDRIn this video, the speaker tests three advanced AI models—Gemini Advanced 2.0, O3 Mini High, and Deep Seek R1—on a challenging physics problem involving electromagnetism. The task involves calculating the force on a charged particle near a non-uniformly charged sphere modeled after the Pokémon Electrode. While Gemini and O3 Mini quickly solve the problem, Deep Seek takes a more thorough, albeit slower, approach. Ultimately, all models arrive at the correct answer, with Deep Seek’s extensive verification process highlighting the trade-off between speed and accuracy. The video showcases the impressive performance of modern AI in solving complex physics problems.

Takeaways

😀 The individual is testing three AI models on a challenging physics problem related to electromagnetism.
😀 The problem involves calculating the force experienced by a charged particle near a non-uniformly charged sphere (Pokémon Electrode).
😀 The problem is designed to be at the level of an undergraduate physics student studying electromagnetism (Griffith's Electrodynamics).
😀 Three AI models are tested: O3 mini High, Gemini Advanced 2.0 (flash thinking experimental), and Deep Seek R1.
😀 The goal is to calculate the force in Teran Newtons (3 Teran Newtons as the correct answer).
😀 Gemini Advanced 2.0 quickly solves the problem in a matter of seconds and provides the correct answer of 3 Teran Newtons.
😀 O3 mini High solves the problem in around 26 seconds, also giving the correct answer of 3 Teran Newtons.
😀 Deep Seek R1 takes significantly longer (nearly 10 minutes), using a chain of thought process to ensure its answer is correct.
😀 Deep Seek R1 eventually arrives at an answer of 3.01 Teran Newtons after thorough verification, highlighting its persistent reasoning process.
😀 All three models ultimately arrive at correct answers, but the speed and verification process of each model differ significantly.
😀 The individual concludes that the problem might have been too easy for Humanity's Last Exam, based on how quickly the AI models solved it.
😀 Deep Seek's lengthy processing time demonstrates the model's thoroughness, even though it may have been unnecessary for this problem's complexity.

Q & A

What is the purpose of Humanity's Last Exam in AI testing?
-Humanity's Last Exam is designed as a benchmark to test the capabilities of AI models. It involves difficult questions created by experts to assess AI's ability to solve complex problems, particularly those that would challenge the intelligence of emerging AI systems.
Why didn't the presenter's problem get accepted for Humanity's Last Exam?
-The presenter’s problem, which was based on electromagnetism, didn’t get picked for Humanity's Last Exam, possibly because it was considered too easy for the AI models being tested, given that the solution could be solved relatively quickly.
What physics topic does the presented problem focus on?
-The problem focuses on electromagnetism, specifically charge densities, electric fields, and Gauss's law, which is commonly studied in an undergraduate electrodynamics course.
What is the specific problem presented in the video?
-The problem involves calculating the force experienced by a charged point particle near a non-uniformly charged sphere, modeled after the Pokémon Electrode. The force is determined using charge density and Maxwell's equations.
How does the charge density of Electrode relate to its maximum charge output?
-The central charge density of Electrode is approximated by dividing its maximum charge output (300 kS) by the total volume of the sphere, providing a simplified expression for the charge density in the problem.
What are the results obtained by the AI models in solving the problem?
-All three AI models—Gemini Advanced 2.0, 03 Mini High, and Deep Seek R1—correctly calculated the force to be approximately 3 teranewtons. Gemini and 03 Mini High reached the answer quickly, while Deep Seek R1 took a longer time to arrive at the same conclusion.
What did the presenter notice about the performance of Gemini Advanced 2.0?
-Gemini Advanced 2.0 was able to solve the problem almost instantaneously, completing it in just a few seconds, demonstrating a very high level of computational efficiency.
How did 03 Mini High perform compared to other models?
-03 Mini High performed relatively quickly, solving the problem in approximately 26 seconds, which was slower than Gemini but still considerably faster than Deep Seek R1.
What was distinctive about Deep Seek R1’s approach to solving the problem?
-Deep Seek R1 took a much slower, more cautious approach, carefully verifying calculations and breaking down the problem into smaller parts. Despite taking about 10 minutes to arrive at the solution, it thoroughly checked its work multiple times.
How does the performance time of Deep Seek R1 compare to the other models?
-Deep Seek R1 took significantly longer (approximately 10 minutes) to solve the problem compared to Gemini (a few seconds) and 03 Mini High (26 seconds), due to its more exhaustive and iterative approach.