Open-Source AI Models Burn More Compute Than Closed Counterparts
A new comprehensive study has unveiled a significant challenge to the prevailing belief that open-source artificial intelligence models offer clear economic advantages over their proprietary counterparts. Research conducted by AI firm Nous Research indicates that open-source models consume substantially more computing resources to perform identical tasks, potentially eroding their perceived cost benefits and necessitating a re-evaluation of enterprise AI deployment strategies.
The study, which analyzed 19 different AI models across a spectrum of tasks including basic knowledge questions, mathematical problems, and logic puzzles, found that open-weight models utilize between 1.5 to 4 times more tokens—the fundamental units of AI computation—than closed models from developers like OpenAI and Anthropic. This disparity was particularly stark for simple knowledge queries, where some open models consumed up to 10 times more tokens. Researchers noted in their report that while open-source models typically boast lower per-token running costs, this advantage can be “easily offset if they require more tokens to reason about a given problem,” making them potentially more expensive per query.
A key metric examined was “token efficiency,” which measures how many computational units models use relative to the complexity of their solutions. This metric, despite its profound cost implications, has received little systematic study until now. The inefficiency is especially pronounced in Large Reasoning Models (LRMs), which employ extended “chains of thought”—step-by-step reasoning processes—to tackle complex problems. These models can, surprisingly, expend hundreds or even thousands of tokens pondering simple questions that should require minimal computation, such as “What is the capital of Australia?”
The research revealed striking differences in efficiency among model providers. OpenAI’s models, including its o4-mini and newly released open-source gpt-oss variants, demonstrated exceptional token efficiency, particularly for mathematical problems, using up to three times fewer tokens than other commercial models. Among the open-source options, Nvidia’s llama-3.3-nemotron-super-49b-v1 emerged as the most token-efficient model across all domains, whereas newer models from companies like Magistral exhibited exceptionally high token usage, standing out as outliers. While open models used roughly twice as many tokens for mathematical and logic problems, the gap widened dramatically for simple knowledge questions where extensive reasoning should be unnecessary.
These findings carry immediate and significant implications for enterprise AI adoption, where computing costs can escalate rapidly with usage. Companies evaluating AI models often prioritize accuracy benchmarks and per-token pricing, frequently overlooking the total computational requirements for real-world tasks. The study concluded that “the better token efficiency of closed weight models often compensates for the higher API pricing of those models” when analyzing total inference costs. This suggests that proprietary model providers have actively optimized their offerings for efficiency, iteratively reducing token usage to lower inference costs. Conversely, some open-source models have shown increased token usage in newer versions, possibly reflecting a prioritization of better reasoning performance over computational frugality.
Measuring efficiency across diverse model architectures presented unique challenges, particularly since many closed-source models do not disclose their raw reasoning processes. To circumvent this, researchers used completion tokens—the total computational units billed for each query—as a proxy for reasoning effort. They discovered that most recent closed-source models provide compressed summaries of their internal computations, often using smaller language models to transcribe complex chains of thought, thereby protecting their proprietary techniques. The study’s methodology also included testing with modified versions of well-known problems, such as altering variables in mathematical competition problems, to minimize the influence of memorized solutions.
Looking ahead, the researchers advocate for token efficiency to become a primary optimization target alongside accuracy in future model development. They suggest that a more “densified CoT” will allow for more efficient context usage and could counteract context degradation during challenging reasoning tasks. The advent of OpenAI’s open-source gpt-oss models, which combine state-of-the-art efficiency with freely accessible chains of thought, could serve as a crucial reference point for optimizing other open-source models. As the AI industry races towards more powerful reasoning capabilities, this study underscores that the true competition may not simply be about who builds the smartest AI, but who can build the most efficient one. After all, in an ecosystem where every token counts, the most wasteful models, regardless of their intellectual prowess, may ultimately find themselves priced out of the market.