Opus 4.6 vs. Chat GPT 5.3 (coloquei os dois a prova)

Lucas Montano

9 Feb 202617:37

Summary

TLDRIn this video, the speaker compares two AI models—OpenAI's GPT 5.3 and Anthropic's Opus 4.6—by testing them on a real-world task: enabling users to log in using their OpenAI subscription credits. While GPT 5.3 shows reluctance and doubts about the solution, Opus investigates the issue methodically and implements the fix. Despite both models failing to fully resolve the issue, Opus delivers a more effective and timely solution. The video highlights the difference in problem-solving approaches, with the speaker ultimately favoring Opus for its responsiveness and adaptability.

Takeaways

😀 The OpenAI GPT 5.3 and Anthropic Opus 4.6 models were tested for solving a real-world integration problem of allowing users to log in with their OpenAI credits in a custom app.
😀 Both models were tasked with implementing Codex CLI Authentication to let users spend their own credits without needing API keys directly.
😀 GPT 5.3 initially pushed back on the request, suggesting the Bring Your Own Key (BYO Key) method instead of focusing on Codex CLI Authentication as requested.
😀 Opus 4.6 immediately started investigating the code, adapting to the request and asking questions to ensure a more accurate solution.
😀 The key difference between the two models is in how they approached the problem: GPT 5.3 questioned the approach, while Opus 4.6 worked proactively to find a solution.
😀 Despite both models failing at the final stage (GPT 5.3 with a missing button and Opus 4.6 with a callback error), Opus delivered the solution faster and with fewer roadblocks.
😀 GPT 5.3 was slower in delivering a solution, often debating the correctness of the prompt, whereas Opus 4.6 was more focused and efficient.
😀 Both models added similar amounts of code (GPT added 1,000 lines and Opus added 1,300), but Opus was able to implement a functional solution in the end.
😀 Opus 4.6's approach was to investigate, adapt, and implement the requested features, while GPT 5.3 spent a significant amount of time challenging the problem itself.
😀 The experience highlighted the differences in model behaviors: GPT 5.3 was rigid and somewhat dismissive, whereas Opus 4.6 was more collaborative and goal-oriented, ultimately solving the problem first.

Q & A

What was the goal of the user in testing the GPT 5.3 and Opus 4.6 models?
-The user aimed to implement a feature in their PSUA app that would allow users to log in using their OpenAI credits. The challenge was to use Codex CLI Authentication (ALF) to enable users to authenticate using their ChatGPT subscription instead of requiring a separate API key.
How did GPT 5.3 initially respond to the user's request?
-GPT 5.3 initially rejected the user's request, claiming that Codex CLI Authentication was not the correct mechanism. It recommended using 'Bring Your Own Key' (BYOK) instead of Codex, and questioned whether the user was asking for the right thing.
What was the major difference in how Opus 4.6 approached the problem compared to GPT 5.3?
-Opus 4.6 took a more investigative and cooperative approach, analyzing the user's project setup and asking relevant questions about the interface design. In contrast, GPT 5.3 focused more on asserting its own perspective without investigating the actual project, making it less cooperative.
What did the user find frustrating about GPT 5.3’s behavior during the process?
-The user found GPT 5.3 frustrating because it constantly questioned the validity of the user's request, leading to long debates about whether the user was asking for the right thing. This delayed progress and made the experience feel like a one-sided argument.
How long did both models take to complete the task?
-Both models took a total of 15 minutes to complete the task. However, the GPT model spent much of that time debating with the user, while Opus spent 8 minutes planning and 7 minutes executing, which led to a faster overall result.
What was the final outcome of both models' implementations?
-Both models failed to fully implement the solution. GPT 5.3 failed because the login button wasn't visible in the interface. Opus 4.6 failed due to an incorrect redirect link in the callback, but it was able to implement the feature more effectively overall.
What did Opus 4.6 do differently when implementing the feature?
-Opus 4.6 asked detailed questions about the interface, such as where to place the login button. This helped ensure the implementation would be more user-friendly. It also worked proactively to investigate the user’s setup and troubleshoot issues during the process.
How did GPT 5.3 and Opus 4.6 differ in terms of adding and removing code?
-GPT 5.3 added 1,000 lines of code and removed 43, whereas Opus 4.6 added 1,300 lines and removed 99. Opus ended up adding more code but was able to deliver a functional solution, while GPT's code additions didn't resolve the problem effectively.
What key insight did the user gain from this testing process?
-The user learned that Opus 4.6 was more flexible and proactive in solving the problem, while GPT 5.3 was rigid in its approach, questioning the user’s instructions and delaying the solution. This experience highlighted that different AI models might be better suited to different tasks.
Why did the user ultimately prefer Opus 4.6 for this task?
-The user preferred Opus 4.6 because it was faster, more cooperative, and actually worked towards solving the problem by investigating the existing setup. Despite some bugs, Opus completed the task more effectively, whereas GPT 5.3 spent too much time debating with the user and failed to resolve the issue.