Nvidia urges shift to smaller, efficient LLMs for AI agents
Researchers at Nvidia are urging the artificial intelligence industry to critically re-evaluate its reliance on massive large language models (LLMs) for AI agent systems, arguing that the current trajectory is both economically and environmentally unsustainable. Instead, they propose a strategic shift towards smaller, more efficient language models, which they term “Small Language Models” (SLMs).
The financial disparity underpinning the current approach is stark. In 2024, the market for LLM APIs, which power many agent systems, was valued at $5.6 billion. However, the cloud infrastructure spending required to support these very systems dwarfed that figure, reaching an estimated $57 billion—a ten-fold difference. This operational model, deeply embedded in the industry, forms the bedrock of substantial capital investments, as highlighted by the researchers in their recent paper.
Nvidia’s team contends that SLMs, defined as models with fewer than 10 billion parameters, are often “principally sufficiently powerful,” “inherently more operationally suitable,” and “necessarily more economical” for the majority of AI agent workloads. They cite compelling examples: Microsoft’s Phi-2, despite its modest size, reputedly rivals 30-billion-parameter LLMs in reasoning and code generation while operating 15 times faster. Similarly, Nvidia’s own Nemotron-H models, with up to 9 billion parameters, are reported to achieve accuracy comparable to 30-billion-parameter LLMs using significantly less computational power. Other models like Deepseek-R1-Distill-Qwen-7B and DeepMind’s RETRO are also presented as proof that smaller systems can match or even surpass the performance of much larger proprietary models on crucial tasks.
The economic advantages of SLMs are particularly compelling. Operating a 7-billion-parameter model can cost 10 to 30 times less than running a 70- to 175-billion-parameter LLM, a calculation that factors in latency, energy consumption, and raw computational requirements. Furthermore, fine-tuning an SLM for specific applications can be accomplished in mere GPU hours, a stark contrast to the weeks often needed for larger models, drastically accelerating adaptation. Many SLMs also possess the capability to run locally on consumer hardware, which not only reduces latency but also grants users greater control over their data privacy. The researchers also point out that SLMs tend to use their parameters more efficiently, whereas larger models frequently activate only a small fraction of their vast parameter count for any given input, leading to inherent inefficiency. They argue that AI agents, which are essentially “heavily instructed and externally choreographed gateways to a language model,” rarely require the full spectrum of capabilities that an LLM provides. Given that most agent tasks are repetitive, narrowly scoped, and not conversational, specialized SLMs fine-tuned for these specific formats represent a far better fit. The recommendation is clear: build heterogeneous agent systems that default to SLMs, reserving larger models only for situations that genuinely demand complex reasoning.
Despite these clear benefits, the shift to SLMs faces significant hurdles. Nvidia’s team identifies the industry’s heavy investment in centralized LLM infrastructure, its pervasive focus on broad benchmark scores, and a general lack of public awareness regarding the advanced capabilities of smaller models as primary barriers. To facilitate this transition, they propose a six-step plan encompassing data collection and curation, task clustering, appropriate SLM selection, fine-tuning for specific needs, and continuous improvement. Their case studies suggest a substantial potential for this shift, finding that between 40 to 70 percent of LLM queries in popular open-source agents like MetaGPT, Open Operator, and Cradle could be handled just as effectively by SLMs.
For many, the transition to SLMs represents not just a technical refinement but also, as the researchers put it, a “Humean moral ought.” This ethical dimension becomes increasingly relevant in light of rising operational costs and the growing environmental impact of large-scale AI infrastructure, a concern recently underscored by Mistral’s detailed data on the energy consumption of its largest models. It might seem paradoxical for Nvidia, a major beneficiary of the LLM boom, to champion smaller models. However, by advocating for more accessible and efficient AI, Nvidia could significantly expand the overall AI market, embedding the technology more deeply across businesses and consumer devices. The company is actively seeking feedback from the community and plans to publish selected responses online, signalling a genuine desire to foster this crucial industry dialogue.