OpenAI's new open source LLMs spark mixed community reactions

Venturebeat

OpenAI, a company whose very name implies openness, recently made a significant pivot by releasing two new large language models (LLMs), gpt-oss-120B and gpt-oss-20B, under the permissive Apache 2.0 open-source license. This move marks the first time since 2019 that OpenAI has made a cutting-edge language model publicly available for unrestricted use, signaling a notable departure from the proprietary, closed-source approach that has defined the ChatGPT era for the past 2.7 years. During this period, users typically paid for access to OpenAI’s models, with limited customization and no ability to run them offline or on private hardware.

The new gpt-oss models aim to democratize access to powerful AI. The larger gpt-oss-120B is designed for deployment on a single Nvidia H100 GPU, suitable for small to medium-sized enterprise data centers, while its smaller counterpart, gpt-oss-20B, is light enough to run on a consumer laptop. However, despite achieving impressive technical benchmarks that align with OpenAI’s own powerful proprietary offerings, the broader AI developer and user community has responded with a remarkably diverse range of opinions, akin to a movie premiere receiving a near 50/50 split on a review aggregator.

Initial independent testing has yielded feedback oscillating between optimistic enthusiasm and an undercurrent of dissatisfaction. Much of the criticism stems from direct comparisons to the growing wave of powerful, multimodal LLMs emerging from Chinese startups, which are also Apache 2.0-licensed and can be freely adapted and run locally anywhere in the world.

While intelligence benchmarks from independent firm Artificial Analysis position gpt-oss-120B as “the most intelligent American open weights model,” it still falls short when measured against Chinese heavyweights like DeepSeek R1 and Qwen3 235B. This disparity has fueled skepticism. A self-proclaimed DeepSeek enthusiast, @teortaxesTex, remarked that the models appear to have merely “mogged on benchmarks,” predicting a lack of good derivative models or new use cases. Pseudonymous open-source AI researcher Teknium, co-founder of Nous Research, echoed this, labeling the release a “legitimate nothing burger” and expressing deep disappointment, anticipating a swift eclipse by a Chinese competitor.

Further criticism has centered on the gpt-oss models’ perceived narrow utility. AI influencer “Lisan al Gaib” observed that while the models excel in math and coding, they “completely lack taste and common sense,” questioning their broader applicability. This “bench-maxxing” approach, optimizing heavily for specific benchmarks, reportedly leads to unusual outputs; Teknium shared a screenshot showing the model injecting an integral formula mid-poem during creative writing tests. Researchers like @kalomaze from Prime Intellect and former Googler Kyle Corbitt speculated that the gpt-oss models were likely trained predominantly on synthetic data—AI-generated data used specifically for training new models. This approach, possibly adopted to circumvent copyright issues or avoid safety problems associated with real-world data, results in models that are “extremely spiky,” performing exceptionally well on trained tasks like coding and math, but poorly on more linguistic tasks such as creative writing or report generation.

Concerns also emerged from third-party benchmark evaluations. SpeechMap, which assesses LLM compliance with user prompts for disallowed or sensitive outputs, showed gpt-oss-120B scoring under 40%, near the bottom of its peers, indicating a strong tendency to default to internal guardrails. In Aider’s Polyglot evaluation, gpt-oss-120B achieved only 41.8% in multilingual reasoning, significantly trailing competitors. Some users also reported an unusual resistance to generating criticism of China or Russia, contrasting with its treatment of the US and EU, raising questions about potential biases in its training data.

Despite these criticisms, not all reactions have been negative. Software engineer Simon Willison praised the release as “really impressive,” highlighting the models’ efficiency and their ability to achieve parity with OpenAI’s proprietary o3-mini and o4-mini models. He commended their strong performance on reasoning and STEM-heavy benchmarks, along with the innovative “Harmony” prompt template and support for third-party tool use. Clem Delangue, CEO of Hugging Face, urged patience, suggesting that early issues might stem from infrastructure instability and insufficient optimization. He emphasized that “the power of open-source is that there’s no cheating,” assuring that the models’ true strengths and limitations would progressively be uncovered.

Ethan Mollick, a professor at the Wharton School, acknowledged that the US now likely possesses leading open-weights models, but questioned OpenAI’s long-term commitment, noting that this lead could “evaporate quickly” if the company lacks incentives to keep the models updated. Nathan Lambert, a prominent AI researcher at the Allen Institute for AI (Ai2), hailed the release’s symbolic importance for the open ecosystem, particularly for Western nations, recognizing the significant step of the most recognized AI brand returning to open releases. However, he cautioned that gpt-oss is “unlikely to meaningfully slow down” Chinese competitors like Qwen, due to their existing usability and variety. Lambert concluded that while the release marks a crucial shift in the U.S. toward open models, OpenAI still has “a long path back” to truly catch up in practice.

Ultimately, the verdict on OpenAI’s gpt-oss models remains split. While they represent a landmark achievement in terms of licensing and accessibility, offering a powerful, free alternative to proprietary systems, their real-world “vibes,” as many users describe them, are proving less compelling than their benchmark scores suggest. The true measure of their success will depend on whether developers can build robust applications and innovative derivatives on top of them, determining if this release is a genuine breakthrough or merely a brief blip in the evolving landscape of artificial intelligence.