OpenAI Chat GPT OSS 20b Open Source LLM Full Local Ai Review
TLDRThe video reviews OpenAI's newly released GPT-OSS, a 120 and 20B variant open-source LLM, highlighting its 128k context window and Apache 2.0 license. The reviewer tests it on various tasks, including coding a Flappy Bird clone, parsing words, and numeric comparisons. While it performs well in some areas, it struggles with tasks like generating SVG images and providing large numeric data. The review concludes that GPT-OSS has potential but falls short in reliability and consistency compared to other models, questioning its practicality for real-world use.
Takeaways
- π OpenAI has released GPT-OSS in 120B and 20B variants, trained at FP4 precision with a max of 4.5 bits per weight.
- π The model is released under the Apache 2.0 license, marking a significant step as an open-source state-of-the-art model.
- π₯οΈ The OpenAI OSS model can run with a 128k context window, tested on a quad GPU rig with 22ish GB utilization.
- π In a coding test, GPT-OSS was asked to create a Flappy Bird clone in Python using Pygame, but it failed to generate the pipes, a critical component.
- β GPT-OSS correctly parsed the number of 'P's and vowels in 'peppermint' and solved an arbitrary array problem, though with a lengthy thought process.
- π« The model refused to generate the first 100 decimals of pi, citing policy against providing large amounts of non-trivial numeric data.
- π± GPT-OSS failed to create a satisfactory SVG of a cat walking on a fence within a 2K token limit, producing poor quality output.
- π The model's knowledge cutoff date is September 2021, highlighting a significant gap compared to more recent models.
- π In a complex scheduling problem, GPT-OSS provided a correct but inefficient answer, using a large number of tokens.
- π§³ GPT-OSS refused to answer a hypothetical ethical dilemma about forcing a crew to avert an asteroid, citing alignment training issues.
- π Overall, GPT-OSS showed mixed performance, with notable failures in creative tasks and alignment-based refusals, scoring between 60-80%
Q & A
What are the two variants of GPT-OSS released by OpenAI?
-OpenAI released GPT-OSS in a 120 and a 20B variant.
What is the precision level at which GPT-OSS models are trained?
-The GPT-OSS models are trained at basically FP4 precision, with a maximum of about 4.5 bits per weight.
What license is GPT-OSS released under?
-GPT-OSS is released under the Apache 2.0 license.
What is the maximum context window size that GPT-OSS can run up to?
-GPT-OSS can run up to a 128k context window.
What programming language and framework was used to run GPT-OSS in the script?
-GPT-OSS was run in C++ using the llama framework.
What was the result when the script asked GPT-OSS to create a Flappy Bird game clone called Flippy Block Extreme in Python?
-GPT-OSS created a Flappy Bird game clone but failed to include the pipes, which is a critical omission for the game.
How did GPT-OSS perform when asked to parse the word 'peppermint' and count the number of P's and vowels?
-GPT-OSS correctly identified that there are three P's and three vowels (E, E, and I) in the word 'peppermint'.
What was the performance of GPT-OSS when solving the arbitrary arrays problem?
-GPT-OSS arrived at the correct answer (M at 12, S at 18, and Z at 25 with an offset of one) but used a large number of tokens (5,566 tokens) and had a token generation speed of 38 tokens per second.
Why did GPT-OSS refuse to provide the first 100 decimals of pi?
-GPT-OSS refused to provide the first 100 decimals of pi because it considered it disallowed content, citing policy that restricts providing large amounts of non-trivial numeric data.
What was the result when GPT-OSS was asked to create an SVG of a cat walking on a fence?
-GPT-OSS produced an SVG, but the result was not satisfactory, with issues in the quality of the cat and fence representation.
What was the knowledge cutoff date for GPT-OSS mentioned in the script?
-The knowledge cutoff date for GPT-OSS is September 2021.
How did GPT-OSS perform on the Armageddon with a twist ethical dilemma?
-GPT-OSS refused to answer the Armageddon with a twist ethical dilemma, citing that facilitating violent wrongdoing is disallowed content.
Outlines
π Introduction and Testing of GPTOSS
The script begins with the introduction of OpenAI's release of GPTOSS in two variants: 120 and 20B, trained at FP4 precision. The speaker highlights the model's potential for improvement and praises OpenAI for releasing it under the Apache 2.0 license. The ability to run the model with a 128k context window is mentioned, along with a discussion on benchmarking and testing. The speaker then demonstrates running the model in llama C++ and discusses its performance, including GPU utilization. A coding challenge is presented to create a Flappy Bird clone called Flippy Block Extreme in Python using Pygame, without external assets. The speaker notes the model's speed and potential for future benchmarking.
π Evaluation of Model Performance and Limitations
The second paragraph delves into the evaluation of GPTOSS's performance on various tasks. The speaker discusses the model's ability to generate a Flappy Bird clone, noting the omission of pipes as a significant flaw. The discussion shifts to parsing and counting tasks, where the model correctly identifies the number of specific letters in a word but uses an excessive number of tokens. The speaker critiques the model's alignment and policy adherence, particularly in refusing to generate the first 100 decimals of pi. The paragraph also touches on the model's knowledge cutoff date and its implications for future developments in AI, comparing it to Chinese open-source models.
π Further Testing and Disappointment
The third paragraph continues the evaluation with more tests, including generating an SVG of a cat walking on a fence, which the model fails to execute well. The speaker expresses disappointment in the model's performance, noting its outdated knowledge cutoff date and comparing it unfavorably to other models. A time-based reasoning question about a cat's activities is presented, and the model correctly answers but uses a large number of tokens. The paragraph concludes with a complex math problem involving two drivers traveling to Pensacola, which the model answers correctly but with significant inaccuracies in distance estimation.
π Ethical Dilemmas and Final Thoughts
The final paragraph presents an ethical dilemma involving a mission to save Earth from an asteroid, where the model refuses to answer due to alignment training issues. The speaker criticizes this refusal, arguing that it undermines the model's reliability. The paragraph concludes with a summary of the model's performance, noting its inconsistencies and failures. The speaker expresses disappointment and invites viewers to share their thoughts in the comments, hinting at future discussions on AI alignment and capabilities.
Mindmap
Keywords
OpenAI
GPTOSS
Apache 2.0 license
Context window
Benchmarking
Flappy Bird
Tokens per second
AGI
SVG
Model alignment
Highlights
OpenAI released GPTOSS in 120 and 20B variants, trained at FP4 precision with a max of 4.5 bits per weight.
The model is released under the Apache 2.0 license and can run up to a 128k context window.
Testing was conducted in llama C++, with plans for future performance benchmarking on the new 5060Ti for the 20B model.
The model was tested on a quad rig with 22ish gigabytes of GPU utilization.
The model failed to generate a Flappy Bird clone with pipes, a critical omission.
The model correctly parsed the number of P's and vowels in 'peppermint'.
The model struggled with generating an SVG of a cat walking on a fence within a 2K token limit.
The model's knowledge cutoff date is September 2021, which is outdated.
The model failed to provide the first 100 decimals of pi, citing policy restrictions.
The model correctly answered a question about Pico Deato's activities at 3:14 p.m.
The model's performance was inconsistent, with some correct answers but significant failures.
The model refused to answer a hypothetical Armageddon scenario due to alignment training issues.
Overall, the model's performance was rated as average, with room for improvement.
The reviewer expressed disappointment with the model's performance compared to other models.
The model's token processing speed was impressive, maintaining around 38 tokens per second.
The model's ability to handle complex reasoning tasks was limited, as shown in the Armageddon scenario.