MoA: Multi-Agent LLM Collaboration for SOTA Performance
The Mixture-of-Agents (MoA) framework is poised to redefine how large language models (LLMs) achieve higher levels of accuracy, reasoning depth, and reliability. Rather than relying on a single, monolithic LLM, MoA orchestrates a team of specialized models that collaborate in structured layers, refining outputs step-by-step. This innovative approach is already yielding state-of-the-art results, even when utilizing open-source models, and has demonstrated the ability to surpass top proprietary LLMs like GPT-4 Omni on multiple benchmarks. Crucially, it achieves this without the prohibitive cost typically associated with scaling a single massive model.
The foundational insight behind MoA stems from a surprising discovery: LLMs exhibit an inherent ability to collaborate. Experiments on the AlpacaEval 2.0 benchmark revealed that various off-the-shelf LLMs, including LLaMA, WizardLM, and Qwen, significantly improved their performance (measured by their “win rate” against a GPT-4 reference) when provided with answers from peer models in addition to the original prompt. This improvement occurred even when the peer answers were inferior to what the model might have produced on its own, suggesting that multiple perspectives help an LLM identify and avoid blind spots. This evidence of intrinsic “collaborativeness” prompted the design of MoA, a framework designed to harness the collective expertise of diverse models.
MoA addresses the challenge of achieving high-quality LLM outputs efficiently through a structured, multi-agent architecture. Its design features multiple layers, with several agents operating within each layer. Every agent receives all previous outputs as input, enabling a process of iterative improvement. Agents are assigned one of two specialized roles: “Proposers” generate diverse candidate answers, contributing valuable context and varied perspectives. “Aggregators,” by contrast, specialize in synthesizing and refining these inputs into a single, higher-quality response, maintaining or even enhancing quality even if some initial inputs are weak. Many models, such as GPT-4, Qwen-1.5, and LLaMA, have demonstrated strong performance in both roles, while others, like WizardLM, excel more as proposers. MoA leverages these strengths by assigning models to the roles where they perform best, all through sophisticated prompt engineering, requiring no fine-tuning.
In practice, MoA organizes these agents into a pipeline of layers. For instance, in an architecture with four layers, the first layer’s proposer agents independently generate initial answers to a user’s prompt. Their outputs are then passed to the subsequent layer, where another set of agents—which can be the same models or different ones—access all prior answers as additional context. This iterative refinement process continues across layers, allowing each successive layer’s agents to work with progressively more comprehensive and robust material. The final layer typically features an aggregator agent that produces the single, consolidated answer, which is far more comprehensive and robust than any initial attempt.
A key strategic decision in MoA is how to assign models to layers. The framework suggests two primary criteria: performance, where stronger models are ideal candidates for later layers, and diversity, emphasizing a mix of model types as heterogeneous models contribute significantly more than identical clones. In many implementations, the final layer employs the strongest available model as the aggregator, while earlier layers are populated with a diverse set of proposers. For example, a powerful open-source model akin to GPT-4 might serve as the final aggregator, synthesizing proposals from specialized smaller models—perhaps a code-focused LLM, a reasoning-focused LLM, or a factual knowledge LLM—depending on the query domain.
The performance of the MoA architecture on rigorous benchmarks has been striking. Using only open-source models, MoA has consistently matched or surpassed the quality of GPT-4. On AlpacaEval 2.0, an open-source MoA configuration achieved a 65.1% win rate, outperforming GPT-4 Omni’s 57.5% and GPT-4 Turbo’s 55.0%. Similarly, on MT-Bench, the open-source MoA scored 9.25, comparable to GPT-4 Turbo’s 9.31 and GPT-4 Omni’s 9.19. Furthermore, fine-grained evaluations using the FLASK framework showed MoA outperforming GPT-4 Omni across critical skill dimensions such as robustness, correctness, factuality, insightfulness, and completeness. These gains were achieved with open models that, collectively, are far more cost-effective than proprietary alternatives. For instance, one MoA setup using six open models across three layers cost only a fraction of GPT-4’s API usage. A lighter variant, MoA-Lite, using just two layers and a smaller aggregator, still slightly beat GPT-4 Omni on AlpacaEval while being even more cost-effective, demonstrating that even a pared-down MoA can deliver superior quality at lower costs.
The effectiveness of MoA lies in its ability to tap into the “wisdom of crowds” among models. Each agent contributes unique strengths—one might provide specific knowledge, another ensures logical consistency, and yet another refines phrasing. The final result benefits from this collective expertise. This goes beyond simple ensemble methods where an LLM merely picks the best answer from multiple options; MoA’s aggregators genuinely synthesize ideas, combining the strongest elements from various proposals.
For developers, MoA offers significant cost-effectiveness and flexibility. By orchestrating smaller open models, it allows for GPT-4-level output without incurring high API fees or the computational burden of running a single, massive model for every query. MoA configurations consistently lie on a favorable quality-cost curve, delivering high win rates at substantially lower costs than GPT-4. For example, some MoA configurations achieved a 4% higher win rate than GPT-4 Turbo at half the inference cost. The framework’s flexibility allows dynamic scaling of agents or layers based on query complexity or available compute, enabling developers to mix and match open models to specialize agents for particular tasks.
Looking ahead, the Mixture-of-Agents framework signals a fundamental shift in AI system design. It moves beyond reliance on single, monolithic models towards creating collaborative teams of specialized LLMs, mirroring how human expert teams operate. These multi-agent ecosystems promise greater robustness and transparency, as each agent’s contribution can be tracked, enhancing trust in the final output. As open-source LLMs continue to advance, MoA-style architectures are poised to become a standard approach for production-grade LLM deployments, scaling quality through sophisticated collaboration rather than sheer model size.