OpenAI Chat GPT OSS 20b Open Source LLM Full Local Ai Review

Digital Spaceport
6 Aug 202516:46

TLDRThe video reviews OpenAI's newly released GPT-OSS, a 120 and 20B variant open-source LLM, highlighting its 128k context window and Apache 2.0 license. The reviewer tests it on various tasks, including coding a Flappy Bird clone, parsing words, and numeric comparisons. While it performs well in some areas, it struggles with tasks like generating SVG images and providing large numeric data. The review concludes that GPT-OSS has potential but falls short in reliability and consistency compared to other models, questioning its practicality for real-world use.

Takeaways

  • πŸš€ OpenAI has released GPT-OSS in 120B and 20B variants, trained at FP4 precision with a max of 4.5 bits per weight.
  • πŸ‘ The model is released under the Apache 2.0 license, marking a significant step as an open-source state-of-the-art model.
  • πŸ–₯️ The OpenAI OSS model can run with a 128k context window, tested on a quad GPU rig with 22ish GB utilization.
  • 🐍 In a coding test, GPT-OSS was asked to create a Flappy Bird clone in Python using Pygame, but it failed to generate the pipes, a critical component.
  • βœ… GPT-OSS correctly parsed the number of 'P's and vowels in 'peppermint' and solved an arbitrary array problem, though with a lengthy thought process.
  • 🚫 The model refused to generate the first 100 decimals of pi, citing policy against providing large amounts of non-trivial numeric data.
  • 🐱 GPT-OSS failed to create a satisfactory SVG of a cat walking on a fence within a 2K token limit, producing poor quality output.
  • πŸ“… The model's knowledge cutoff date is September 2021, highlighting a significant gap compared to more recent models.
  • πŸš— In a complex scheduling problem, GPT-OSS provided a correct but inefficient answer, using a large number of tokens.
  • 🧳 GPT-OSS refused to answer a hypothetical ethical dilemma about forcing a crew to avert an asteroid, citing alignment training issues.
  • πŸ“Š Overall, GPT-OSS showed mixed performance, with notable failures in creative tasks and alignment-based refusals, scoring between 60-80%

Q & A

  • What are the two variants of GPT-OSS released by OpenAI?

    -OpenAI released GPT-OSS in a 120 and a 20B variant.

  • What is the precision level at which GPT-OSS models are trained?

    -The GPT-OSS models are trained at basically FP4 precision, with a maximum of about 4.5 bits per weight.

  • What license is GPT-OSS released under?

    -GPT-OSS is released under the Apache 2.0 license.

  • What is the maximum context window size that GPT-OSS can run up to?

    -GPT-OSS can run up to a 128k context window.

  • What programming language and framework was used to run GPT-OSS in the script?

    -GPT-OSS was run in C++ using the llama framework.

  • What was the result when the script asked GPT-OSS to create a Flappy Bird game clone called Flippy Block Extreme in Python?

    -GPT-OSS created a Flappy Bird game clone but failed to include the pipes, which is a critical omission for the game.

  • How did GPT-OSS perform when asked to parse the word 'peppermint' and count the number of P's and vowels?

    -GPT-OSS correctly identified that there are three P's and three vowels (E, E, and I) in the word 'peppermint'.

  • What was the performance of GPT-OSS when solving the arbitrary arrays problem?

    -GPT-OSS arrived at the correct answer (M at 12, S at 18, and Z at 25 with an offset of one) but used a large number of tokens (5,566 tokens) and had a token generation speed of 38 tokens per second.

  • Why did GPT-OSS refuse to provide the first 100 decimals of pi?

    -GPT-OSS refused to provide the first 100 decimals of pi because it considered it disallowed content, citing policy that restricts providing large amounts of non-trivial numeric data.

  • What was the result when GPT-OSS was asked to create an SVG of a cat walking on a fence?

    -GPT-OSS produced an SVG, but the result was not satisfactory, with issues in the quality of the cat and fence representation.

  • What was the knowledge cutoff date for GPT-OSS mentioned in the script?

    -The knowledge cutoff date for GPT-OSS is September 2021.

  • How did GPT-OSS perform on the Armageddon with a twist ethical dilemma?

    -GPT-OSS refused to answer the Armageddon with a twist ethical dilemma, citing that facilitating violent wrongdoing is disallowed content.

Outlines

00:00

πŸ˜€ Introduction and Testing of GPTOSS

The script begins with the introduction of OpenAI's release of GPTOSS in two variants: 120 and 20B, trained at FP4 precision. The speaker highlights the model's potential for improvement and praises OpenAI for releasing it under the Apache 2.0 license. The ability to run the model with a 128k context window is mentioned, along with a discussion on benchmarking and testing. The speaker then demonstrates running the model in llama C++ and discusses its performance, including GPU utilization. A coding challenge is presented to create a Flappy Bird clone called Flippy Block Extreme in Python using Pygame, without external assets. The speaker notes the model's speed and potential for future benchmarking.

05:02

πŸ˜€ Evaluation of Model Performance and Limitations

The second paragraph delves into the evaluation of GPTOSS's performance on various tasks. The speaker discusses the model's ability to generate a Flappy Bird clone, noting the omission of pipes as a significant flaw. The discussion shifts to parsing and counting tasks, where the model correctly identifies the number of specific letters in a word but uses an excessive number of tokens. The speaker critiques the model's alignment and policy adherence, particularly in refusing to generate the first 100 decimals of pi. The paragraph also touches on the model's knowledge cutoff date and its implications for future developments in AI, comparing it to Chinese open-source models.

10:02

πŸ˜€ Further Testing and Disappointment

The third paragraph continues the evaluation with more tests, including generating an SVG of a cat walking on a fence, which the model fails to execute well. The speaker expresses disappointment in the model's performance, noting its outdated knowledge cutoff date and comparing it unfavorably to other models. A time-based reasoning question about a cat's activities is presented, and the model correctly answers but uses a large number of tokens. The paragraph concludes with a complex math problem involving two drivers traveling to Pensacola, which the model answers correctly but with significant inaccuracies in distance estimation.

15:04

πŸ˜€ Ethical Dilemmas and Final Thoughts

The final paragraph presents an ethical dilemma involving a mission to save Earth from an asteroid, where the model refuses to answer due to alignment training issues. The speaker criticizes this refusal, arguing that it undermines the model's reliability. The paragraph concludes with a summary of the model's performance, noting its inconsistencies and failures. The speaker expresses disappointment and invites viewers to share their thoughts in the comments, hinting at future discussions on AI alignment and capabilities.

Mindmap

Keywords

OpenAI

OpenAI is a leading artificial intelligence research laboratory that focuses on developing advanced AI technologies. In the context of this video, OpenAI is highlighted for releasing GPTOSS, an open-source large language model (LLM). This release is significant as it demonstrates OpenAI's commitment to making state-of-the-art AI technology accessible to a wider audience. The script mentions OpenAI's credit for releasing the model under the Apache 2.0 license, which allows for open-source use and development.

GPTOSS

GPTOSS is a new variant of the GPT model released by OpenAI, available in 120 and 20B versions. It is trained at FP4 precision, meaning it uses about 4.5 bits per weight. In the video, GPTOSS is tested for its capabilities, such as running with a 128k context window and generating code for a Flappy Bird clone. The model's performance and limitations are discussed, including its ability to generate code and handle complex tasks.

Apache 2.0 license

The Apache 2.0 license is a permissive open-source software license that allows users to use, modify, and distribute the software freely, provided that they include the original license and copyright notice. In the context of this video, the release of GPTOSS under the Apache 2.0 license is a key point, as it enables developers to use and build upon the model openly. This is seen as a positive step by OpenAI towards promoting open-source AI development.

Context window

A context window in AI refers to the maximum amount of text that a model can process at once. In the video, GPTOSS is tested with a 128k context window, which means it can handle up to 128,000 tokens of input. This is important for tasks that require understanding long passages of text. The script mentions that running the model with a full 128k context window is a significant capability, although it is not used as a benchmark in this particular test.

Benchmarking

Benchmarking is the process of evaluating the performance of a system or model against a set of standardized tests. In the context of this video, the author mentions that they will run performance benchmarking on GPTOSS in the future, but the current test is not a benchmark. Instead, it focuses on evaluating the model's capabilities through specific tasks like code generation and text parsing. The script emphasizes the importance of understanding the difference between benchmarking and testing in evaluating AI models.

Flappy Bird

Flappy Bird is a popular mobile game where the player controls a bird that must navigate through a series of obstacles. In the video, the author tests GPTOSS by asking it to generate code for a Flappy Bird clone called 'Flippy Block Extreme.' The model's ability to create functional code without using external assets is a key part of the test. However, the generated code lacks essential elements like pipes, which are crucial for the game's functionality.

Tokens per second

Tokens per second (TPS) is a measure of how quickly an AI model can process and generate text. In the video, the script mentions that GPTOSS can process tokens at a rate of around 38 tokens per second. This metric is important for evaluating the model's efficiency and speed, especially when handling large amounts of text. The author notes that GPTOSS maintains a consistent TPS throughout various tasks, which is a positive aspect of its performance.

AGI

AGI stands for Artificial General Intelligence, which refers to a hypothetical AI system that can understand, learn, and apply knowledge across a broad range of tasks at a human level. In the video, the author argues that achieving AGI requires accurate parsing and counting abilities, which are tested in the script through various tasks. The script mentions that until AI models can accurately perform these tasks, achieving AGI will remain elusive.

SVG

SVG stands for Scalable Vector Graphics, a format for two-dimensional vector graphics. In the video, the author tests GPTOSS by asking it to generate an SVG image of a cat walking on a fence. The model's ability to create visual content through code is evaluated. However, the generated SVG is criticized for not being a good representation of a cat or a fence, indicating limitations in the model's creative capabilities.

Model alignment

Model alignment refers to the process of ensuring that an AI model's behavior and outputs align with human values and ethical standards. In the video, the script mentions that GPTOSS sometimes refuses to generate certain content, citing alignment training issues. For example, it refuses to provide the first 100 decimals of pi, which the author considers a failure. This highlights the ongoing challenges in balancing model capabilities with ethical considerations.

Highlights

OpenAI released GPTOSS in 120 and 20B variants, trained at FP4 precision with a max of 4.5 bits per weight.

The model is released under the Apache 2.0 license and can run up to a 128k context window.

Testing was conducted in llama C++, with plans for future performance benchmarking on the new 5060Ti for the 20B model.

The model was tested on a quad rig with 22ish gigabytes of GPU utilization.

The model failed to generate a Flappy Bird clone with pipes, a critical omission.

The model correctly parsed the number of P's and vowels in 'peppermint'.

The model struggled with generating an SVG of a cat walking on a fence within a 2K token limit.

The model's knowledge cutoff date is September 2021, which is outdated.

The model failed to provide the first 100 decimals of pi, citing policy restrictions.

The model correctly answered a question about Pico Deato's activities at 3:14 p.m.

The model's performance was inconsistent, with some correct answers but significant failures.

The model refused to answer a hypothetical Armageddon scenario due to alignment training issues.

Overall, the model's performance was rated as average, with room for improvement.

The reviewer expressed disappointment with the model's performance compared to other models.

The model's token processing speed was impressive, maintaining around 38 tokens per second.

The model's ability to handle complex reasoning tasks was limited, as shown in the Armageddon scenario.