Google's Active Learning Slashes LLM Training Data by 10,000x
Large Language Models (LLMs) show great promise for complex tasks like classifying unsafe ad content. Identifying content that violates advertising policies demands a deep understanding of context and cultural nuances, areas where LLMs often outperform traditional machine learning systems. However, fine-tuning LLMs for such intricate challenges typically requires vast quantities of high-fidelity training data, which is both difficult and expensive to acquire. This challenge is compounded by “concept drift” – the continuous evolution of safety policies and the emergence of new forms of unsafe content, often necessitating costly retraining on entirely new datasets. Consequently, minimizing the data required for training has become a critical objective.
To address this, Google Ads has developed a new, scalable process for active learning. This innovative approach drastically reduces the amount of training data needed for fine-tuning LLMs while significantly improving the model’s alignment with human experts. The process can be applied to datasets containing hundreds of billions of examples, iteratively identifying only the most valuable instances for human annotation, and then using these expert-provided labels for model fine-tuning. In experiments, this method reduced the scale of training data from 100,000 examples to fewer than 500, simultaneously boosting model-human alignment by up to 65 percent. For larger models in production, even greater reductions have been observed, using up to four orders of magnitude less data while maintaining or improving quality.
The curation process begins with an initial LLM, which, with minimal or no prior specific training, is given a prompt defining the content of interest – for instance, “Is this ad clickbait?” This initial LLM then labels a massive dataset of ads as either “clickbait” or “benign.” Since only a tiny fraction of production ads are truly clickbait, and the un-tuned LLM has a low true positive rate, this initial dataset is typically highly imbalanced. To pinpoint the most informative examples, the system then clusters both the “clickbait” and “benign” labels. Crucially, it identifies areas where these clusters overlap, signaling instances where the LLM is most confused or uncertain about the correct classification. From these ambiguous regions, pairs of examples closest to each other but with different labels are selected. If necessary to stay within budget, the system prioritizes pairs that represent a larger portion of the search space. This curated set is both highly informative, focusing on examples near the model’s decision boundary, and diverse, drawing from various parts of that boundary. These selected examples are then sent to human experts for definitive labeling.
The expert-provided labels are then divided into two sets: one for model evaluation and another for fine-tuning the current LLM, creating the next iteration of the model. This iterative process continues until the model’s alignment with human experts either matches the internal agreement among the experts themselves or plateaus, indicating no further improvement is possible.
For classification problems in ads safety, such as content moderation or fraud detection, there is often no single “ground truth” due to inherent ambiguity requiring expert interpretation. Therefore, standard metrics like precision and recall, which depend on a definitive ground truth, are unsuitable. Instead, Google’s researchers employ Cohen’s Kappa, a statistical measure that quantifies the level of agreement between two independent annotators or, in this case, between the model and human experts, beyond what might occur by random chance. A Kappa score closer to 1 indicates strong agreement, while 0 suggests agreement no better than chance. Scores above 0.8 are generally considered exceptionally good, and values above 0.4 are deemed acceptable.
To evaluate the new curation process, experiments were conducted using two Gemini Nano LLMs of different sizes (1.8 billion and 3.25 billion parameters) on two ad safety tasks of varying complexity. For baseline comparisons, these models were fine-tuned using approximately 100,000 crowdsourced annotations, which typically had a significant class imbalance (around 95 percent benign labels). In the curated conditions, the same models were fine-tuned over multiple rounds using the new active learning process. The models plateaued after 5 to 6 iterations, requiring only about 250 to 450 expert-labeled fine-tuning examples and 150 to 250 evaluation samples in total.
The results demonstrated a clear advantage for the curated approach, especially with the larger model. While the 1.8 billion parameter model showed comparable, albeit lower, performance in both baseline and curated conditions (Kappa scores around 0.24-0.25), the 3.25 billion parameter model saw substantial quality improvements with the new curation process. For the lower complexity task, its Kappa score jumped from 0.36 (baseline) to 0.56 (curated); for the higher complexity task, it improved from 0.23 to 0.38. This represents a 55-65 percent improvement in alignment with human experts, achieved by using three orders of magnitude less data—a few hundred examples compared to 100,000 in the baseline.
These findings underscore that carefully curating LLM datasets to focus on fewer, more informative examples can yield superior or equivalent classifier performance with significantly less data. While the experiments showed a three-order-of-magnitude reduction, production systems with even larger models have achieved up to four orders of magnitude less data usage. Such gains, however, hinge on extremely high-quality human annotations; a label quality above 0.8 pairwise Cohen’s Kappa has been observed as necessary to reliably outperform crowdsourced data. By intelligently combining the LLMs’ ability to broadly survey a problem space with human experts’ precision in handling challenging examples, this curation process offers a flexible and efficient way to overcome the data bottleneck, particularly crucial for rapidly evolving domains like ads safety.