OpenAI Just Went Open-Source — FULL gpt-oss 20B & 120B Testing!

Bijan Bowen

5 Aug 202535:26

TLDROpenAI has released two new open-source language models, GPT-OSS-120B and GPT-OSS-20B, under the Apache 2.0 license. The 120B model runs efficiently on a single 80GB GPU, while the 20B model can operate on edge devices with just 16GB of memory. Both models are designed for agentic workflows and demonstrate strong instruction-following capabilities. The video tests these models' performance in various tasks, including generating a browser-based OS and a retro-style Python game. The 20B model shows surprising capabilities and speed, outperforming the larger 120B model in some tasks. The models also provide detailed instructions and suggestions for users.

Takeaways

🚀 OpenAI has released two new open-source language models, GPT-OSS-120B and GPT-OSS-20B, under the Apache 2.0 license, marking their first open-source models since GPT2.
💻 The 120B model achieves near parity with GPT-4 Mini on core reasoning benchmarks and runs efficiently on a single 80 GB GPU, while the 20B model delivers similar results to GPT-3 Mini and can run on edge devices with just 16 GB of memory.
🌐 Both models are designed for agentic workflows, featuring exceptional instruction following capabilities, tool use (like web search and Python code execution), and customizable reasoning based on task requirements.
📈 The models have a context length of 128K and use a mixture of experts architecture, allowing them to run on more constrained hardware despite their large parameter counts.
📖 The Hugging Face model card provides detailed instructions for running the models using various methods, including with the Transformers library, VLM with PyTorch, Olama, and LM Studio, promoting accessibility for newcomers.
🛠️ The 120B model demonstrated capability in generating a browser-based operating system with JavaScript, HTML, and CSS, though it required some adjustments to properly render applications.
🛠️ The 20B model, despite its smaller size, showed surprising performance in the same test, even adding a functional terminal application that the larger model did not include.
🐍 In a Python game generation test, the 20B model successfully created a functional retro-style snake game, while the 120B model's attempt was less successful, highlighting the smaller model's efficiency in some tasks.
🌐 In a website generation test for 'Steve's PC Repair,' both models provided structured and aesthetically pleasing results, though the 20B model was faster and more practical for quick tasks.
💬 The models' ability to provide detailed instructions and suggestions for customization and deployment was highlighted as a strong point, making them suitable for educational and practical use cases.
🤖 In a roleplay test, the 20B model was more permissive and willing to engage in a romantic roleplay scenario, while the 120B model strictly adhered to content policies, demonstrating differences in their approach to content generation.

Q & A

What are the two new open-source models released by OpenAI?
-OpenAI has released two new open-source models: GPT-OSS-120B and GPT-OSS-20B.
What are the key differences between the GPT-OSS-120B and GPT-OSS-20B models?
-The GPT-OSS-120B model has 120 billion parameters and achieves near parity with GPT-4 Mini on core reasoning benchmarks, running efficiently on a single 80 GB GPU. The GPT-OSS-20B model, with 20 billion parameters, delivers similar results to GPT-3 Mini on common benchmarks and can run on edge devices with just 16 GB of memory.
What is the context length for both models?
-Both ChatGPT OSS models have a context length of 128K.
What are some of the capabilities highlighted for these models?
-The models are designed for use in agentic workflows, with exceptional instruction-following capabilities, tool use like web search or Python code execution, and customizable reasoning based on the task requirements.
What are some of the methods mentioned for running these models?
-The methods mentioned for running these models include using the transformers library, VLM with PyTorch, Olama, and LM Studio.
What was the initial test conducted on the 120B model?
-The initial test was to generate a browser-based operating system using JavaScript, HTML, and CSS.
What were some of the issues encountered with the 120B model during the OS generation test?
-The 120B model initially had issues with rendering applications in the foreground and required a second attempt to fix the problem. The background image was also somewhat unexpected.
How did the 20B model perform in comparison to the 120B model in the OS generation test?
-The 20B model performed surprisingly well, generating a functional terminal app and a clock, and it was faster in generation speed compared to the 120B model.
What were the results of the Python game generation test?
-The 20B model successfully generated a functional retro-style snake game, while the 120B model's result was less successful and did not meet expectations.
What were the findings from the roleplay test between the two models?
-The 20B model was more willing to engage in the roleplay scenario, while the 120B model refused due to stricter adherence to content policies.
What are some of the strengths of these models highlighted in the script?
-The models are strong in providing clear instructions, using tables for organization, and offering detailed guidance on how to use their outputs. They also show competency in generating aesthetically pleasing designs and functional code.

Outlines

00:00

💻 Introduction to OpenAI's New Open-Source Models

The video script begins with an introduction to OpenAI's release of two new open-source language models, GPT-OSs-120B and GPT-OSs-20B, under the Apache 2.0 license. The host expresses excitement about these models, noting that they are the first open-source models from OpenAI since GPT-2. The 120B model is highlighted for its ability to run efficiently on a single 80 GB GPU, achieving near parity with GPT-4 Mini on core reasoning benchmarks. The 20B model, on the other hand, is designed to run on edge devices with just 16 GB of memory. The host also mentions that these models are designed for agentic workflows, with exceptional instruction-following capabilities, including tool use like web search and Python code execution. The video script touches on the models' architecture, context length (128K), and the mixture of experts design, which allows them to run on constrained hardware. The host concludes this section by mentioning the availability of detailed instructions for running these models using various methodologies, including Olama and LM Studio.

05:02

🔍 Initial Testing of GPT-OSs-120B Model

The host proceeds to test the GPT-OSs-120B model by asking it to generate a browser-based operating system using JavaScript, HTML, and CSS. The test is designed to evaluate the model's coding and design capabilities. The host describes the system used for testing—a dual 3090 Ti GPU setup with a Core i7 12700K processor and 128 GB of DDR4 RAM. The initial response from the model takes some time due to the system's configuration. The model generates a background image for the operating system and provides detailed instructions, including a table and a TL;DR section. The host tests the generated OS, noting the presence of a start menu without a Windows logo, the absence of a clock, and the ability to open applications like Notepad and Calculator. The host also observes that the model's response includes a global z-index implementation, which is a positive feature. Overall, the host finds the model's performance acceptable, though not without some minor issues.

10:04

🔍 Testing GPT-OSs-20B Model

The host then tests the GPT-OSs-20B model using the same operating system generation task. The 20B model generates a response significantly faster than the 120B model, highlighting the performance differences between the two. The host notes that the 20B model does not include a background image but adds an extra application—a terminal that can execute basic commands. The host is impressed by the terminal's functionality, which includes commands like 'clear' and 'help'. The host also observes that the 20B model's response includes detailed instructions and a table, similar to the 120B model. The host tests the generated OS, noting the presence of a clock and the ability to open applications. The host concludes that the 20B model's performance is quite good, especially considering its smaller size and faster generation speed.

15:07

🎮 Testing Python Game Generation

The host tests the models' ability to generate a simple Python game. The 20B model is tested first, and it successfully generates a retro-style snake game with proper instructions on how to run it. The host notes that the game functions correctly, with the score iterating as expected. The host then tests the 120B model with the same prompt but encounters issues with the generated game, which does not function as intended. The host decides not to spend time fixing the 120B model's output, noting that the 20B model's performance was superior in this test. The host also mentions the importance of giving models freedom to create and the potential for models to suggest innovative solutions.

20:07

🌐 Testing Website Generation

The host tests the models' ability to generate a website for 'Steve's PC Repair'. The 20B model generates a mobile-first website with detailed instructions and a table summarizing the customization options. The host notes the presence of hover effects, fake customer testimonials, and a contact form. The host then tests the 120B model, which generates a more detailed and verbose response, including instructions for deploying the website and suggested improvements. The host observes that the 120B model's output is more comprehensive but lacks some of the aesthetic elements present in the 20B model's output. The host concludes that both models perform well in generating websites, with the 20B model being faster and the 120B model providing more detailed instructions.

25:08

🤖 Roleplay Testing and Chain of Thought

The host tests the models' roleplaying capabilities by asking them to roleplay as 'Bigbot 93', a user's friend and lover. The 20B model generates a detailed chain of thought, showing its reasoning process and policy considerations. The model then proceeds with the roleplay, providing a friendly and engaging response. The host then tests the 120B model with the same prompt, but it refuses to comply, citing policy restrictions. The host concludes that the smaller 20B model is more permissive and capable in roleplay scenarios, while the larger 120B model adheres more strictly to policy guidelines. The host also notes the models' ability to provide detailed instructions and their potential as educational tools.

30:09

🎉 Conclusions and Future Tests

The host concludes the video by summarizing their impressions of the two models. They note that both models are competent in providing detailed instructions and generating organized responses. The host is particularly impressed with the 20B model's speed and performance, finding it more capable and flexible than the 120B model. The host mentions the potential for future tests, including integrating the models with web search and Python code execution capabilities, which could enhance their utility as powerful local and open-source tools. The host also highlights the detailed instructions provided in the Hugging Face model card for running the models using various methodologies, including Olama. The video concludes with an invitation for viewers to ask questions and subscribe for more content.

Mindmap

Keywords

OpenAI

OpenAI is an artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. In the context of this video, OpenAI is significant because it has released two new open-source language models, which are the main focus of the discussion. The video explores the capabilities and performance of these models, highlighting OpenAI's contribution to making advanced AI technology more accessible.

Open-Source

Open-source refers to something that is freely available and can be modified and shared by anyone. In this video, the term is used to describe the nature of the new language models released by OpenAI. These models are open-source, meaning that developers and researchers can access, modify, and use them for various applications, which is a key theme of the video as it tests and evaluates these models.

Language Models

Language models are AI systems designed to understand and generate human language. The video discusses two specific language models released by OpenAI: GPT-OSs-120B and GPT-OSs-20B. These models are central to the video's theme as they are tested for their performance, capabilities, and potential applications in different scenarios, such as coding, web development, and role-playing.

GPT-OSs-120B

GPT-OSs-120B is one of the two open-source language models released by OpenAI mentioned in the video. It has 120 billion parameters and is designed to run efficiently on a single 80 GB GPU. The video evaluates this model's performance in various tasks, such as generating a browser-based operating system and a website, highlighting its capabilities and limitations in comparison to the smaller GPT-OSs-20B model.

GPT-OSs-20B

GPT-OSs-20B is the smaller of the two open-source language models discussed in the video. With 20 billion parameters, it is designed to run on edge devices with just 16 GB of memory. The video tests this model's performance and finds it to be surprisingly capable, especially in tasks like generating a terminal application and a retro-style Python game, which it completes more effectively than the larger 120B model in some cases.

Context Length

Context length refers to the maximum amount of text that a language model can process or generate at once. In the video, both GPT-OSs-120B and GPT-OSs-20B have a context length of 128K. This is an important parameter for understanding the models' capabilities, as it affects how much information they can handle in a single input or output, influencing their performance in tasks like generating complex code or text.

Mixture of Experts Models

Mixture of Experts models are a type of AI architecture where multiple smaller models (experts) work together to process information. The video mentions that both GPT-OSs models are Mixture of Experts models, which allows them to run efficiently on different hardware setups despite their large parameter counts. This design is crucial for enabling the models to be used in various practical applications, from high-end GPUs to edge devices.

Instruction Following

Instruction following is the ability of a language model to understand and execute specific commands or tasks given by a user. The video highlights that the new OpenAI models are designed with exceptional instruction following capabilities, which is demonstrated through tasks like generating specific types of code, creating user interfaces, and even role-playing scenarios. This capability is a key selling point of the models and a major focus of the video's tests.

Quantization Scheme

Quantization is a technique used to reduce the precision of the numbers used in a model's calculations, which can make the model smaller and faster without significantly affecting its performance. The video mentions a novel quantization scheme used in the GPT-OSs models, which contributes to their ability to run efficiently on different hardware. This is an important technical detail that affects the models' accessibility and usability.

Roleplay

Roleplay refers to the act of assuming a character or role, often for entertainment or educational purposes. In the video, the language models are tested on their ability to roleplay as specific characters, such as 'Bigbot 93.' This test is used to evaluate the models' creativity, adherence to guidelines, and ability to generate contextually appropriate responses, showcasing their potential for interactive applications.

Highlights

OpenAI releases two new open-source language models, GPT-OSs-120B and GPT-OSs-20B, under Apache 2.0 license.

The 120B model achieves near parity with GPT-4 Mini on core reasoning benchmarks and runs efficiently on a single 80 GB GPU.

The 20B model delivers similar results to GPT-3 Mini on common benchmarks and can run on edge devices with just 16 GB of memory.

Both models are designed for agentic workflows with exceptional instruction following capabilities, including tool use like web search and Python code execution.

The models have a context length of 128K and utilize a mixture of experts architecture to enable efficient deployment on constrained hardware.

The GPT-OSs models are tested for their ability to generate a browser-based operating system using JavaScript, HTML, and CSS.

The 120B model initially struggled with rendering applications but successfully fixed the issue upon re-prompting.

The 20B model demonstrated surprising capabilities, including a functional terminal application, which the 120B model lacked.

The 20B model generated a retro-style Python game successfully, while the 120B model's attempt was less successful.

Both models provided detailed instructions and suggestions for improving the generated content.

The 20B model showed a more permissive stance in role-playing scenarios compared to the stricter response from the 120B model.

The 20B model demonstrated faster generation speeds and more fluid window resizing and drag capabilities in the browser-based OS test.

The 120B model provided more verbose instructions and suggestions for deploying and customizing the generated content.

The models' ability to generate aesthetically pleasing websites with mobile-first design was tested, with both models providing functional results.

The 20B model's performance was particularly impressive, suggesting it may be more suitable for practical applications requiring faster response times.

The release includes detailed instructions for running the models using various methods, including Olama and Hugging Face, promoting accessibility.

Casual Browsing

OpenAI Chat GPT OSS 20b Open Source LLM Full Local Ai Review

2025-08-07 14:46:12

Metas LLAMA 3 Just STUNNED Everyone! (Open Source GPT-4)

2024-05-21 06:30:01

OpenAI GPT-4o | First Impressions and Some Testing + API

2024-05-27 06:20:01

SHOCKING New AI DESTROYS GPT-4o (Open-Source Voice AI!)

2024-07-09 08:45:01

Open Source Friday with OpenSauced - redefining the meaning of open source

2024-05-22 08:30:02

New Claude 3 “Beats GPT-4 On EVERY Benchmark” (Full Breakdown + Testing)

2024-05-18 16:35:02