Google's CTCL: Lightweight AI for Private Data Synthesis
The generation of large-scale, privacy-preserving synthetic data poses a significant challenge in artificial intelligence, primarily due to an inherent trade-off between robust privacy guarantees, computational demands, and the utility of the generated data. Achieving strong privacy often necessitates either compromising data quality or incurring substantial computational costs. A common approach involves privately fine-tuning massive, billion-parameter large language models (LLMs) on sensitive “private data” – the dataset intended for privacy protection – and then sampling from these adapted models. However, this method is computationally intensive and impractical for many resource-constrained applications. Recent algorithms like Aug-PE and Pre-Text have attempted to bypass this by relying on LLM API access, yet they frequently depend on extensive manual prompting and struggle to effectively leverage private information during iterative data selection.
Addressing these limitations, researchers at Google have developed CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for creating privacy-preserving synthetic data. Presented at ICML 2025, CTCL eliminates the need for fine-tuning billion-scale LLMs or engaging in domain-specific prompt engineering. Instead, it utilizes a lightweight 140-million-parameter model, making it a viable solution for resource-constrained environments. By incorporating topic information, CTCL ensures that the generated synthetic data accurately reflects the topic distribution of the original private domain. Crucially, unlike algorithms such as Aug-PE, CTCL can generate an unlimited number of synthetic data samples without incurring additional privacy costs, leveraging a fundamental property of differential privacy. Extensive evaluations across diverse datasets have shown CTCL’s consistent superior performance over baseline methods, particularly when strong privacy guarantees are required. Furthermore, ablation studies have underscored the vital roles of its pre-training and keyword-based conditioning in achieving these results, alongside demonstrating CTCL’s improved scalability compared to Aug-PE.
The CTCL framework is meticulously designed to produce high-quality synthetic data from private datasets while rigorously maintaining privacy. Its operation unfolds in three primary stages, built upon two core components developed once using extensive public corpora: CTCL-Topic and CTCL-Generator. CTCL-Topic serves as a universal topic model, identifying high-level themes, while CTCL-Generator is a powerful language model capable of generating documents based on specific input conditions like keywords.
The initial phase involves developing these components. CTCL-Topic is derived from Wikipedia, clustering documents into roughly 1,000 distinct topics, each represented by ten keywords. Concurrently, CTCL-Generator, a 140-million-parameter conditional language model, is constructed through continual pre-training on a massive dataset of description-document pairs, created by prompting Gemma-2-2B to describe documents from SlimPajama.
In the second stage, the framework learns the private domain. CTCL-Topic captures the high-level topic distribution from the private corpus, collecting a privacy-preserving histogram that quantifies the percentage of each topic. Each private document is then associated with a topic, yielding ten keywords. The CTCL-Generator is subsequently fine-tuned with differential privacy on this transformed dataset of keywords and document pairs.
The final stage is the generation of synthetic data. The differentially private fine-tuned CTCL-Generator is sampled proportionally for each topic, guided by the privacy-preserving topic histogram. This allows for precise control over the synthetic dataset’s composition. A key advantage is that the CTCL-Generator can produce an arbitrary amount of synthetic data without incurring any additional privacy costs, a benefit derived from the post-processing property of differential privacy.
Experiments were conducted on four diverse datasets: three for generative tasks (PubMed, Chatbot Arena, Multi-Session Chat) and one for a classification task (OpenReview). Generative tasks, which evaluate next-token prediction accuracy, are more demanding as they require preserving fine-grained textual information. Quality was assessed by training a small downstream language model or classifier on the synthetic data and measuring its accuracy on real test data, with careful measures to prevent data contamination.
The results consistently demonstrated CTCL’s superior performance across all datasets, especially under strong privacy guarantees (smaller epsilon values). It outperformed baselines like directly differentially private fine-tuning and Aug-PE, highlighting its robust ability to capture valuable private information while maintaining high privacy standards.
Furthermore, CTCL exhibited better scalability than Aug-PE in terms of both privacy budget and synthetic data volume. CTCL’s performance improved with an increased privacy budget, a trend not observed with Aug-PE. Similarly, downstream model accuracy continued to rise with more CTCL-generated samples, whereas Aug-PE’s performance plateaued. These findings underscore that fine-tuning-based methods, like CTCL, are more effective at capturing fine-grained statistics than prompting-based methods such as Aug-PE.
Ablation studies further validated the critical impact of two design elements: the pre-training of the CTCL-Generator on public corpora and the integration of keyword-based conditions during differentially private fine-tuning. These studies revealed that incorporating keywords during fine-tuning reduced test loss by 50%, with an additional 50% reduction gained from adding pre-training, for a fixed privacy budget. This confirms both components are fundamental to the framework’s efficacy.
Looking ahead, while CTCL currently employs a 140-million-parameter generator, the underlying principle of using clustering information or LLM-extracted metadata as input instructions can be readily extended to larger models. This avenue is actively being explored to further enhance real-world applications of privacy-preserving data synthesis.