Meta CLIP 2: First Worldwide Multilingual CLIP Model from Scratch

Marktechpost

Contrastive Language-Image Pre-training (CLIP) has emerged as a foundational technology for modern computer vision and multimodal AI models, powering capabilities like zero-shot image classification and serving as crucial vision components within multimodal large language models (MLLMs). However, the widespread adoption of CLIP has encountered a significant limitation: most variants, including Meta CLIP, have historically relied on English-only datasets for their training. This oversight neglects a vast wealth of non-English content available across the global web, creating a bottleneck for truly universal AI applications.

The challenge of expanding CLIP beyond English is twofold. Firstly, there’s a notable absence of efficient methods for curating high-quality, non-English data at the immense scale required for such models. Secondly, integrating multilingual data often leads to a phenomenon dubbed the “curse of multilinguality,” where adding non-English content paradoxically degrades performance on English-language tasks. These intertwined issues have severely hampered the development of unified AI models capable of excelling across both English and non-English linguistic environments.

Previous attempts to address these limitations have faced their own hurdles. Models like OpenAI CLIP and the original Meta CLIP were inherently tied to English-centric data curation. Distillation-based approaches, which transfer knowledge from a larger “teacher” model, often introduce biases from these external sources. While SigLIP and SigLIP 2 explored using data from Google Image Search, their reliance on proprietary sources restricts scalability. Other multilingual CLIP models, such as M-CLIP and mCLIP, adopted distillation, using an English-only CLIP as a visual encoder and training multilingual text encoders with lower-quality data. Hybrid methods like SLIP and LiT combined language supervision with self-supervised learning, aiming for a balance between semantic understanding and visual representation. Yet, despite these varied efforts, none fully resolved the core dilemma of scaling CLIP globally without performance trade-offs.

A collaborative research effort from Meta, MIT, Princeton University, and New York University has now introduced Meta CLIP 2, marking a significant leap forward. This new method is the first to train CLIP models from scratch using native worldwide image-text pairs, entirely bypassing external resources such as private datasets, machine translation, or distillation. Meta CLIP 2 aims to eliminate the performance trade-offs between English and non-English data by meticulously designing and jointly scaling its metadata, data curation processes, model capacity, and training methodologies. Critically, it maximizes compatibility with OpenAI CLIP’s architecture, ensuring broad applicability to existing CLIP models and their variants.

The innovation behind Meta CLIP 2’s global scalability rests on three key pillars: the development of scalable metadata encompassing over 300 languages, a sophisticated per-language curation algorithm designed to ensure a balanced distribution of concepts, and an advanced training framework. To overcome the challenge of data availability, researchers leveraged globally curated data. For the “curse of multilinguality,” they developed a worldwide CLIP training framework that largely mirrors OpenAI and Meta CLIP’s established settings and model architecture, but with crucial additions: a multilingual text tokenizer, a strategy for scaling “seen” training pairs, and a thorough analysis of the minimal viable model capacity required for optimal performance.

To ensure generalizability, the training setup incorporated OpenAI CLIP’s ViT-L/14 and Meta CLIP’s ViT-H/14 models, modified for multilingual support. Studies on model expressivity revealed that even OpenAI’s ViT-L/14 struggled with the “curse” due to its limited capacity when faced with global data. In contrast, the larger ViT-H/14 model proved to be an inflection point, achieving notable performance gains in both English and non-English tasks.

When trained on the ViT-H/14 model with worldwide data and scaled seen pairs, Meta CLIP 2 demonstrated superior performance, outperforming its English-only counterparts by 1.0x and non-English counterparts by 1.3x in both English and multilingual tasks. However, the “curse” persisted in settings where data scaling was not applied or when smaller models like ViT-L/14 were used. The transition from English-centric metadata to worldwide equivalents proved essential. For instance, simply removing the English filter on “alt-texts” (descriptive image tags) led to a slight 0.6% drop in ImageNet accuracy, underscoring the impact of language isolation. Conversely, replacing English metadata with merged worldwide metadata initially lowered English performance but significantly boosted multilingual capabilities. Evaluations on zero-shot classification and few-shot geo-localization benchmarks consistently showed improved results when scaling from 13 billion English pairs to 29 billion worldwide pairs, with the exception of performance saturation observed in the GeoDE benchmark.

In essence, Meta CLIP 2 represents a paradigm shift. It is the first CLIP model trained from scratch on a truly global scale using native image-text pairs. Its success demonstrates that by strategically scaling metadata, curation, and training capacity, the long-standing “curse of multilinguality” can be broken, leading to mutual benefits for both English and non-English language performance. The ViT-H/14 variant of Meta CLIP 2, for example, surpasses its English-only counterpart on zero-shot ImageNet (improving from 80.5% to 81.3%) and achieves outstanding results on multilingual benchmarks such as XM3600, Babel-IN, and CVQA, all within a single, unified model. By open-sourcing its metadata, curation methods, and training code, Meta CLIP 2 empowers the global research community to move decisively beyond English-centric approaches, unlocking the full potential of the worldwide multimodal web.