Clarifai Benchmarks GPT-OSS on NVIDIA H100s & B200s
The landscape of artificial intelligence continues its rapid evolution, with new open-weight models and hardware innovations pushing the boundaries of what’s possible. Recent insights from Clarifai highlight significant advancements, particularly in the realm of large language model (LLM) performance on cutting-edge hardware, alongside expanded tools for developers.
At the forefront of these developments are OpenAI’s newly released GPT-OSS-120b and GPT-OSS-20b models, a generation of open-weight reasoning models made available under the Apache 2.0 license. Designed for robust instruction following, powerful tool integration, and advanced reasoning capabilities, these models are poised to drive the next wave of AI-driven automated processes. Their architecture features a Mixture of Experts (MoE) design and an extended context length of 131,000 tokens. Notably, the 120 billion parameter model can operate efficiently on a single 80 GB GPU, thanks to advanced quantization techniques, balancing massive scale with practical deployment. Developers gain flexibility, able to fine-tune reasoning levels to optimize for speed, cost, or accuracy, and leverage built-in functionalities like web browsing, code execution, and custom tool integration for complex tasks.
Clarifai’s research team recently put the GPT-OSS-120b model through rigorous benchmarking on NVIDIA B200 and H100 GPUs, employing sophisticated inference frameworks such as vLLM, SGLang, and TensorRT-LLM. The tests encompassed both single-request scenarios and high-concurrency workloads, simulating environments with 50 to 100 simultaneous requests. The results underscore the transformative potential of the B200 architecture. In single-request scenarios, the B200, when paired with TensorRT-LLM, achieved a remarkable time-to-first-token (TTFT) of just 0.023 seconds, outperforming dual-H100 setups in several instances. For high-concurrency demands, the B200 demonstrated superior sustained throughput, maintaining 7,236 tokens per second at maximum load with reduced per-token latency. These findings suggest that a single B200 GPU can match or exceed the performance of two H100s, while simultaneously offering lower power consumption and simplified infrastructure. Some workloads even saw up to a 15-fold increase in inference speed compared to a single H100. While GPT-OSS models are currently deployable on H100s via Clarifai across multiple cloud environments, support for B200s is anticipated soon, promising access to NVIDIA’s latest GPU technology for both testing and production.
Beyond hardware optimization, Clarifai is enhancing its platform for developers. The “Local Runners” feature, which allows users to run open-source models on their own hardware while still leveraging the Clarifai platform, has seen significant adoption. This capability now extends to the latest GPT-OSS models, including GPT-OSS-20b, empowering developers with full control over their compute resources for local testing and instant deployment of agentic workflows. To further facilitate this, Clarifai has introduced a new Developer Plan at a promotional price of just $1 per month. This plan expands on the existing Community Plan by enabling connection of up to five Local Runners and offering unlimited runner hours.
Clarifai has also significantly expanded its model library, making a diverse range of open-weight and specialized models readily available for various workflows. Among the latest additions are the GPT-OSS-120b, designed for strong reasoning and efficient on-device deployment; the GPT-5, GPT-5 Mini, and GPT-5 Nano, which cater to demanding reasoning tasks, real-time applications, and ultra-low-latency edge deployments, respectively; and Qwen3-Coder-30B-A3B-Instruct, a high-efficiency coding model with robust agentic capabilities suitable for code generation and development automation. These models are accessible through the Clarifai Playground or via API for integration into custom applications.
Further streamlining local model deployment, Clarifai has integrated support for Ollama, a popular tool for running open-source models directly on personal machines. This integration allows Local Runners to expose locally hosted Ollama models via a secure public API, and a new Ollama toolkit within the Clarifai CLI simplifies the process of downloading, running, and exposing these models with a single command.
User experience improvements have also been rolled out in the Clarifai Playground, including the ability to compare multiple models side-by-side. This feature enables developers to quickly discern differences in output, speed, and quality, facilitating optimal model selection. Enhanced inference controls, Pythonic support, and model version selectors further refine the experimentation process. Additional platform updates include improvements to the Python SDK for better logging and pipeline handling, refined token-based billing, and enhanced workflow pricing visibility, alongside improvements to Clarifai Organizations for better user management.
Through its Compute Orchestration capabilities, Clarifai is enabling the deployment of advanced models like GPT-OSS and Qwen3-Coder on dedicated GPUs, whether on-premises or in the cloud. This provides developers with granular control over performance, cost, and security for serving models, multi-cloud platform (MCP) servers, or complete agentic workflows directly from their hardware.