OpenAI Just Went Open-Source โ FULL gpt-oss 20B & 120B Testing!
TLDROpenAI has released two new open-source language models, GPT-OSS-120B and GPT-OSS-20B, under the Apache 2.0 license. The 120B model runs efficiently on a single 80GB GPU, while the 20B model can operate on edge devices with just 16GB of memory. Both models are designed for agentic workflows and demonstrate strong instruction-following capabilities. The video tests these models' performance in various tasks, including generating a browser-based OS and a retro-style Python game. The 20B model shows surprising capabilities and speed, outperforming the larger 120B model in some tasks. The models also provide detailed instructions and suggestions for users.
Takeaways
- ๐ OpenAI has released two new open-source language models, GPT-OSS-120B and GPT-OSS-20B, under the Apache 2.0 license, marking their first open-source models since GPT2.
- ๐ป The 120B model achieves near parity with GPT-4 Mini on core reasoning benchmarks and runs efficiently on a single 80 GB GPU, while the 20B model delivers similar results to GPT-3 Mini and can run on edge devices with just 16 GB of memory.
- ๐ Both models are designed for agentic workflows, featuring exceptional instruction following capabilities, tool use (like web search and Python code execution), and customizable reasoning based on task requirements.
- ๐ The models have a context length of 128K and use a mixture of experts architecture, allowing them to run on more constrained hardware despite their large parameter counts.
- ๐ The Hugging Face model card provides detailed instructions for running the models using various methods, including with the Transformers library, VLM with PyTorch, Olama, and LM Studio, promoting accessibility for newcomers.
- ๐ ๏ธ The 120B model demonstrated capability in generating a browser-based operating system with JavaScript, HTML, and CSS, though it required some adjustments to properly render applications.
- ๐ ๏ธ The 20B model, despite its smaller size, showed surprising performance in the same test, even adding a functional terminal application that the larger model did not include.
- ๐ In a Python game generation test, the 20B model successfully created a functional retro-style snake game, while the 120B model's attempt was less successful, highlighting the smaller model's efficiency in some tasks.
- ๐ In a website generation test for 'Steve's PC Repair,' both models provided structured and aesthetically pleasing results, though the 20B model was faster and more practical for quick tasks.
- ๐ฌ The models' ability to provide detailed instructions and suggestions for customization and deployment was highlighted as a strong point, making them suitable for educational and practical use cases.
- ๐ค In a roleplay test, the 20B model was more permissive and willing to engage in a romantic roleplay scenario, while the 120B model strictly adhered to content policies, demonstrating differences in their approach to content generation.
Q & A
What are the two new open-source models released by OpenAI?
-OpenAI has released two new open-source models: GPT-OSS-120B and GPT-OSS-20B.
What are the key differences between the GPT-OSS-120B and GPT-OSS-20B models?
-The GPT-OSS-120B model has 120 billion parameters and achieves near parity with GPT-4 Mini on core reasoning benchmarks, running efficiently on a single 80 GB GPU. The GPT-OSS-20B model, with 20 billion parameters, delivers similar results to GPT-3 Mini on common benchmarks and can run on edge devices with just 16 GB of memory.
What is the context length for both models?
-Both ChatGPT OSS models have a context length of 128K.
What are some of the capabilities highlighted for these models?
-The models are designed for use in agentic workflows, with exceptional instruction-following capabilities, tool use like web search or Python code execution, and customizable reasoning based on the task requirements.
What are some of the methods mentioned for running these models?
-The methods mentioned for running these models include using the transformers library, VLM with PyTorch, Olama, and LM Studio.
What was the initial test conducted on the 120B model?
-The initial test was to generate a browser-based operating system using JavaScript, HTML, and CSS.
What were some of the issues encountered with the 120B model during the OS generation test?
-The 120B model initially had issues with rendering applications in the foreground and required a second attempt to fix the problem. The background image was also somewhat unexpected.
How did the 20B model perform in comparison to the 120B model in the OS generation test?
-The 20B model performed surprisingly well, generating a functional terminal app and a clock, and it was faster in generation speed compared to the 120B model.
What were the results of the Python game generation test?
-The 20B model successfully generated a functional retro-style snake game, while the 120B model's result was less successful and did not meet expectations.
What were the findings from the roleplay test between the two models?
-The 20B model was more willing to engage in the roleplay scenario, while the 120B model refused due to stricter adherence to content policies.
What are some of the strengths of these models highlighted in the script?
-The models are strong in providing clear instructions, using tables for organization, and offering detailed guidance on how to use their outputs. They also show competency in generating aesthetically pleasing designs and functional code.
Outlines
๐ป Introduction to OpenAI's New Open-Source Models
The video script begins with an introduction to OpenAI's release of two new open-source language models, GPT-OSs-120B and GPT-OSs-20B, under the Apache 2.0 license. The host expresses excitement about these models, noting that they are the first open-source models from OpenAI since GPT-2. The 120B model is highlighted for its ability to run efficiently on a single 80 GB GPU, achieving near parity with GPT-4 Mini on core reasoning benchmarks. The 20B model, on the other hand, is designed to run on edge devices with just 16 GB of memory. The host also mentions that these models are designed for agentic workflows, with exceptional instruction-following capabilities, including tool use like web search and Python code execution. The video script touches on the models' architecture, context length (128K), and the mixture of experts design, which allows them to run on constrained hardware. The host concludes this section by mentioning the availability of detailed instructions for running these models using various methodologies, including Olama and LM Studio.
๐ Initial Testing of GPT-OSs-120B Model
The host proceeds to test the GPT-OSs-120B model by asking it to generate a browser-based operating system using JavaScript, HTML, and CSS. The test is designed to evaluate the model's coding and design capabilities. The host describes the system used for testingโa dual 3090 Ti GPU setup with a Core i7 12700K processor and 128 GB of DDR4 RAM. The initial response from the model takes some time due to the system's configuration. The model generates a background image for the operating system and provides detailed instructions, including a table and a TL;DR section. The host tests the generated OS, noting the presence of a start menu without a Windows logo, the absence of a clock, and the ability to open applications like Notepad and Calculator. The host also observes that the model's response includes a global z-index implementation, which is a positive feature. Overall, the host finds the model's performance acceptable, though not without some minor issues.
๐ Testing GPT-OSs-20B Model
The host then tests the GPT-OSs-20B model using the same operating system generation task. The 20B model generates a response significantly faster than the 120B model, highlighting the performance differences between the two. The host notes that the 20B model does not include a background image but adds an extra applicationโa terminal that can execute basic commands. The host is impressed by the terminal's functionality, which includes commands like 'clear' and 'help'. The host also observes that the 20B model's response includes detailed instructions and a table, similar to the 120B model. The host tests the generated OS, noting the presence of a clock and the ability to open applications. The host concludes that the 20B model's performance is quite good, especially considering its smaller size and faster generation speed.
๐ฎ Testing Python Game Generation
The host tests the models' ability to generate a simple Python game. The 20B model is tested first, and it successfully generates a retro-style snake game with proper instructions on how to run it. The host notes that the game functions correctly, with the score iterating as expected. The host then tests the 120B model with the same prompt but encounters issues with the generated game, which does not function as intended. The host decides not to spend time fixing the 120B model's output, noting that the 20B model's performance was superior in this test. The host also mentions the importance of giving models freedom to create and the potential for models to suggest innovative solutions.
๐ Testing Website Generation
The host tests the models' ability to generate a website for 'Steve's PC Repair'. The 20B model generates a mobile-first website with detailed instructions and a table summarizing the customization options. The host notes the presence of hover effects, fake customer testimonials, and a contact form. The host then tests the 120B model, which generates a more detailed and verbose response, including instructions for deploying the website and suggested improvements. The host observes that the 120B model's output is more comprehensive but lacks some of the aesthetic elements present in the 20B model's output. The host concludes that both models perform well in generating websites, with the 20B model being faster and the 120B model providing more detailed instructions.
๐ค Roleplay Testing and Chain of Thought
The host tests the models' roleplaying capabilities by asking them to roleplay as 'Bigbot 93', a user's friend and lover. The 20B model generates a detailed chain of thought, showing its reasoning process and policy considerations. The model then proceeds with the roleplay, providing a friendly and engaging response. The host then tests the 120B model with the same prompt, but it refuses to comply, citing policy restrictions. The host concludes that the smaller 20B model is more permissive and capable in roleplay scenarios, while the larger 120B model adheres more strictly to policy guidelines. The host also notes the models' ability to provide detailed instructions and their potential as educational tools.
๐ Conclusions and Future Tests
The host concludes the video by summarizing their impressions of the two models. They note that both models are competent in providing detailed instructions and generating organized responses. The host is particularly impressed with the 20B model's speed and performance, finding it more capable and flexible than the 120B model. The host mentions the potential for future tests, including integrating the models with web search and Python code execution capabilities, which could enhance their utility as powerful local and open-source tools. The host also highlights the detailed instructions provided in the Hugging Face model card for running the models using various methodologies, including Olama. The video concludes with an invitation for viewers to ask questions and subscribe for more content.
Mindmap
Keywords
OpenAI
Open-Source
Language Models
GPT-OSs-120B
GPT-OSs-20B
Context Length
Mixture of Experts Models
Instruction Following
Quantization Scheme
Roleplay
Highlights
OpenAI releases two new open-source language models, GPT-OSs-120B and GPT-OSs-20B, under Apache 2.0 license.
The 120B model achieves near parity with GPT-4 Mini on core reasoning benchmarks and runs efficiently on a single 80 GB GPU.
The 20B model delivers similar results to GPT-3 Mini on common benchmarks and can run on edge devices with just 16 GB of memory.
Both models are designed for agentic workflows with exceptional instruction following capabilities, including tool use like web search and Python code execution.
The models have a context length of 128K and utilize a mixture of experts architecture to enable efficient deployment on constrained hardware.
The GPT-OSs models are tested for their ability to generate a browser-based operating system using JavaScript, HTML, and CSS.
The 120B model initially struggled with rendering applications but successfully fixed the issue upon re-prompting.
The 20B model demonstrated surprising capabilities, including a functional terminal application, which the 120B model lacked.
The 20B model generated a retro-style Python game successfully, while the 120B model's attempt was less successful.
Both models provided detailed instructions and suggestions for improving the generated content.
The 20B model showed a more permissive stance in role-playing scenarios compared to the stricter response from the 120B model.
The 20B model demonstrated faster generation speeds and more fluid window resizing and drag capabilities in the browser-based OS test.
The 120B model provided more verbose instructions and suggestions for deploying and customizing the generated content.
The models' ability to generate aesthetically pleasing websites with mobile-first design was tested, with both models providing functional results.
The 20B model's performance was particularly impressive, suggesting it may be more suitable for practical applications requiring faster response times.
The release includes detailed instructions for running the models using various methods, including Olama and Hugging Face, promoting accessibility.