DeepSeek o1 o3-mini coding test experiment results

Internet of Bugs

3 Feb 202511:13

Summary

TLDRIn this detailed analysis, the speaker evaluates the performance of AI models—DeepSeek-R1, ChatGPT o1, and o3-mini—in coding challenges. Through tests involving Zig, Go, and SQL, the speaker finds that while DeepSeek and o3-mini show promise, they often fail to generate correct solutions, especially in complex tasks. ChatGPT o1 performs better in coding accuracy but struggles with error handling and reasoning. Despite their flaws, DeepSeek offers a cheaper, faster alternative for simpler tasks. The video highlights the current limitations of AI in programming, suggesting that while useful, they still lack the reasoning abilities of human developers.

Takeaways

😀 DeepSeek and o3-mini were recently released AI models, and the speaker has conducted experiments with them on coding tasks.
😀 The speaker has a history of evaluating AI coding standards and has tested both DeepSeek and ChatGPT o3-mini on generating code.
😀 Both DeepSeek and o3-mini struggle with more complex coding tasks, such as writing non-trivial programs in less common languages like Zig.
😀 In common languages like Go, o1 (ChatGPT's full version) performs better than DeepSeek, though it still struggles with error handling and proper variable management.
😀 o3-mini appears to be slower and more verbose compared to the main o1 model, showing no significant improvement.
😀 Despite DeepSeek's code being better written in some areas (like error handling), both DeepSeek and o3-mini fail to generate correct outputs for some tasks.
😀 The Code Crafters site provides structured challenges to test AI models, allowing a comparison of how well AIs solve real-world problems versus idealized scenarios.
😀 AI models, including o1 and DeepSeek, tend to get stuck on tasks involving error handling and proactive debugging, often requiring multiple attempts to fix issues.
😀 Despite DeepSeek's faster performance and lower cost, o1 is generally more successful at solving coding problems correctly, particularly in DNS and Go-related challenges.
😀 The speaker emphasizes that AIs like DeepSeek and o3-mini are still far from replacing human programmers for more complex tasks and that the quality of their code often falls short of expectations.
😀 Although DeepSeek might be preferred for simple coding tasks due to its speed and cost, o1 remains the more reliable choice for solving coding challenges with accuracy.

Q & A

What is the focus of the video?
-The video compares the code generation abilities of three AI models—DeepSeek R1, ChatGPT o1, and ChatGPT o3-mini—through coding challenges in different programming languages, such as Zig, Go, and SQLite.
How does DeepSeek compare to ChatGPT o1 in code generation tasks?
-DeepSeek performs similarly to ChatGPT o1, but both models fail to generate correct code for more complex tasks. While o1 generally performs better in coding challenges, DeepSeek is faster, cheaper, and an open model.
What is the main issue with both DeepSeek and ChatGPT o3-mini during code generation?
-Both DeepSeek and ChatGPT o3-mini struggle with tasks requiring reasoning or deeper understanding of the problem. They often fail to produce code that compiles or produces the correct answers, especially in non-trivial programming challenges.
What programming languages were used in the coding challenges?
-The coding challenges were conducted using Zig, Go, and SQLite. Zig and Go were used for tasks involving coding in those specific languages, while SQLite was chosen for tasks involving database manipulation.
How did the AI models perform in generating Zig code?
-All three AI models failed to generate a Zig program that would even compile. Despite multiple attempts, none of the models succeeded in producing functional code in Zig for a simple task.
Why was the Go programming language chosen for further testing?
-Go was chosen for further testing because it is a well-known language, ranking highly in the TIOBE Index, and the challenge did not involve a specific SQL implementation but rather basic code generation tasks, making it a better test of the AI models' capabilities.
What error handling issue did ChatGPT o1 encounter during the Go coding challenge?
-ChatGPT o1 removed key error handling code, such as 'Log.Fatal()' statements, which led to issues when the code encountered unexpected behavior. The AI returned incorrect answers (e.g., zero instead of four) without indicating any problems.
How did DeepSeek's approach to error handling compare to ChatGPT o1?
-DeepSeek's code was generally better written in terms of error checking, though it still missed critical steps, like not seeking past a 100-byte header before reading the required variable. However, it did not exhibit the same level of error-handling issues as o1.
What was the outcome of DeepSeek and o1 during the DNS server coding challenge?
-During the DNS server challenge, o1 performed much better than DeepSeek, successfully completing the first four stages while DeepSeek struggled on the second stage and made irrelevant changes that did not address the actual problem.
What conclusion did the video come to about the performance of these AI models?
-The video concludes that while DeepSeek is cheaper, faster, and an open model, ChatGPT o1 performs better overall in coding challenges, especially when it comes to producing correct answers. However, both models still face significant limitations in complex problem-solving tasks.