OpenAI GPT-OSS: New Open-Weight Models Benchmarked Against Rivals
OpenAI has unveiled GPT-OSS-120b and GPT-OSS-20b, a new series of open-weight reasoning models released under the permissive Apache 2.0 license. These text-only models are engineered for robust instruction following, sophisticated tool use, and strong reasoning capabilities, positioning them as prime candidates for integration into advanced agentic workflows. This release underscores OpenAI’s continued dedication to fostering innovation and collaborative safety within the broader AI community.
A critical question for developers and researchers is how these new models stack up against other leading contenders in the rapidly evolving open- and semi-open-weight ecosystem. To provide clarity, a detailed comparison of GPT-OSS against models like GLM-4.5, Qwen3-Thinking, DeepSeek-R1, and Kimi K2 offers valuable insights into their respective strengths and trade-offs.
The GPT-OSS models build upon the foundational architectures of GPT-2 and GPT-3, notably incorporating a Mixture-of-Experts (MoE) design. This architectural choice is pivotal for efficiency during both training and inference, as it activates only a subset of parameters per token. This allows the models to achieve the scale of very large systems while meticulously controlling compute costs. The family comprises two models: GPT-OSS-120b, featuring 116.8 billion total parameters with approximately 5.1 billion active per token across 36 layers, and GPT-OSS-20b, which has 20.9 billion total parameters with 3.6 billion active per token across 24 layers. Both models share several advanced architectural elements, including a residual stream dimension of 2880, Grouped Query Attention with 64 query heads and 8 key-value heads, and Rotary position embeddings for enhanced contextual reasoning. They also boast an extended context length of 131,072 tokens, leveraging YaRN.
To ensure practical deployment, OpenAI has applied MXFP4 quantization to the MoE weights. This innovative technique enables the 120-billion-parameter model to operate efficiently on a single 80GB GPU, while its 20-billion-parameter sibling can run on hardware with as little as 16GB of memory, significantly broadening accessibility. Another notable feature is “variable reasoning effort,” allowing developers to specify “low,” “medium,” or “high” reasoning levels via the system prompt. This dynamically adjusts the length of the Chain-of-Thought (CoT), offering flexibility in balancing accuracy, latency, and compute cost. Furthermore, the models are trained with built-in support for agentic workflows, including a browsing tool for real-time web search, a Python tool for stateful code execution in a Jupyter-like environment, and support for custom developer functions, facilitating complex, interleaved reasoning and user interaction.
The open-model ecosystem is rich with formidable contenders, each possessing distinct strengths. Comparing GPT-OSS across various benchmarks — reasoning, coding, and agentic workflows — provides a clearer understanding of its standing.
In broad knowledge and reasoning tasks, GPT-OSS demonstrates some of the highest scores relative to its size. On MMLU-Pro, GPT-OSS-120b achieves an impressive 90.0%, surpassing GLM-4.5 (84.6%), Qwen3-Thinking (84.4%), DeepSeek-R1 (85.0%), and Kimi K2 (81.1%). For competition-style math tasks, GPT-OSS truly shines, reaching 96.6% on AIME 2024 and an even higher 97.9% on AIME 2025 with tool assistance, outperforming all other compared models. On the GPQA PhD-level science benchmark, GPT-OSS-120b scores 80.9% with tools, comparable to GLM-4.5 (79.1%) and Qwen3-Thinking (81.1%), and just shy of DeepSeek-R1 (81.0%). These figures are particularly significant given GPT-OSS-120b’s efficient MoE design, where only 5.1 billion parameters are active per token. In contrast, GLM-4.5 and Qwen3-Thinking are considerably larger dense models, which partly explains their strong tool use and coding results. DeepSeek-R1 also tends towards higher parameter counts and deeper token usage for reasoning, while Kimi K2 is a smaller, more specialized instruction-tuned model. This efficiency means GPT-OSS delivers frontier-level reasoning with a lighter active parameter footprint, making it a cost-effective choice for deep reasoning tasks.
When it comes to coding and software engineering, modern AI benchmarks assess a model’s capacity to understand large codebases, implement changes, and execute multi-step reasoning. On SWE-bench Verified, GPT-OSS-120b scores 62.4%, closely trailing GLM-4.5 (64.2%) and DeepSeek-R1 (approximately 65.8% in agentic mode). On Terminal-Bench, GLM-4.5 leads with 37.5%, followed by Kimi K2 at around 30%. GLM-4.5 also exhibits strong performance in head-to-head agentic coding tasks, achieving over 50% win rates against Kimi K2 and over 80% against Qwen3, while maintaining a high success rate for tool-based coding workflows. Again, model size plays a role here; GLM-4.5 is a much larger dense model than GPT-OSS-120b and Kimi K2, granting it an advantage in agentic coding. However, for developers seeking robust code-editing capabilities in a model that can run on a single 80GB GPU, GPT-OSS offers a compelling balance.
Agentic capabilities—where a model autonomously calls tools, executes functions, and solves multi-step tasks—are becoming increasingly vital. On TAU-bench Retail, GPT-OSS-120b scores 67.8%, compared to GLM-4.5’s 79.7% and Kimi K2’s 70.6%. For BFCL-v3, a function-calling benchmark, GLM-4.5 leads with 77.8%, followed by Qwen3-Thinking at 71.9%, with GPT-OSS scoring around 67–68%. These results highlight a common trade-off: GLM-4.5 excels in function-calling and agentic workflows, but it does so as a significantly larger, more resource-intensive model. GPT-OSS, in contrast, delivers competitive results while remaining accessible to developers who may not have access to multi-GPU clusters.
In summary, the landscape of open-weight models presents diverse strengths. GPT-OSS stands out for its ability to deliver frontier-level reasoning and long-form Chain-of-Thought capabilities with a smaller active parameter footprint than many dense models. GLM-4.5, a heavyweight dense model, leads in agentic workflows and function-calling but demands substantially more compute resources. DeepSeek-R1 and Qwen3 offer strong hybrid reasoning performance at larger scales, while Kimi K2 targets specialized coding workflows with a more compact setup.
This makes GPT-OSS a compelling proposition, striking an impressive balance between reasoning performance, coding ability, and deployment efficiency. It is well-suited for experimentation, seamless integration into agentic systems, and resource-aware production workloads.