Tencent's Hunyuan-Large-Vision: China's Top Multimodal AI Model

Decoder

Tencent has unveiled Hunyuan-Large-Vision, a new multimodal artificial intelligence model that has quickly established itself as a frontrunner in China’s competitive AI landscape. The model now leads all Chinese entries on the LMArena Vision Leaderboard, positioning itself directly behind top-tier Western models such as GPT-5 and Gemini 2.5 Pro.

Built upon a sophisticated mixture-of-experts architecture, Hunyuan-Large-Vision boasts an impressive 389 billion parameters, with 52 billion actively engaged during operation. This design allows the model to selectively activate only the most relevant components for a given task, enhancing efficiency and performance. Its capabilities are reportedly comparable to those of Claude Sonnet 3.5, a leading model in its own right. On the OpenCompass Academic Benchmark, Tencent reports that Hunyuan-Large-Vision achieved an average score of 79.5, demonstrating its robust analytical prowess.

The new model has surpassed its predecessor, Qwen2.5-VL, as the top-rated Chinese contender on the LMArena Vision Leaderboard, which ranks AI image models based on community preferences in head-to-head comparisons. Hunyuan-Large-Vision exhibits exceptional performance across a wide array of visual and language tasks. While comparisons with Western models are notable, it’s worth noting that the Western benchmarks used in these comparisons may not always reflect the very latest releases.

Tencent showcased the model’s versatility through a diverse range of applications. It can accurately identify specific plant species, such as Iris lactea, and even compose poetry inspired by a photograph of the Seine River. Beyond creative endeavors, it offers strategic advice in complex games like Go and demonstrates proficiency in translating questions into various languages, including less common ones, a significant improvement over Tencent’s earlier vision models.

At its core, Hunyuan-Large-Vision integrates three primary modules: a custom vision transformer with a billion parameters dedicated to processing visual information, a connector module designed to seamlessly bridge vision and language understanding, and a language model leveraging the mixture-of-experts technique. The vision transformer underwent initial training to establish connections between images and text, followed by extensive refinement using over a trillion multimodal text samples. This rigorous training has enabled it to outperform other popular models on intricate multimodal tasks.

Tencent has also implemented a novel training pipeline for multimodal data. This system transforms vast quantities of noisy raw data into high-quality instruction data, employing pre-trained AI and specialized tools. The result is a massive dataset of over 400 billion multimodal text samples spanning visual recognition, mathematics, scientific reasoning, and optical character recognition (OCR). Further refinement of the model involved a technique called Rejection Sampling, where multiple responses are generated for a given prompt, and only the best ones are retained. Automated tools were also utilized to filter out errors and redundancies, and complex answers were distilled into more concise forms to enhance reasoning efficiency.

The training process itself benefited from Tencent’s Angel-PTM framework and a multi-level load balancing strategy. These innovations significantly reduced GPU bottlenecks by 18.8 percent, accelerating the overall training timeline.

Currently, Hunyuan-Large-Vision is exclusively available via API on Tencent Cloud. Unlike some of Tencent’s previous AI models, this version is not open source. Given its substantial 389 billion parameters, running the model on typical consumer hardware would be impractical, underscoring its design for large-scale cloud-based applications.