Synthetic Data: AI's New Gold Rush or 'Data Laundering'?

Fastcompany

The rapid advancement of artificial intelligence is approaching a critical bottleneck: a diminishing supply of high-quality training data. As websites increasingly implement barriers to data scraping and existing public content is voraciously consumed by AI models, concerns are mounting that the wellspring of usable information could soon run dry. The industry’s proposed solution, however, has ignited a fierce debate: synthetic data.

This concept, where AI models generate their own training data, is gaining significant traction within the tech community. Sebastien Bubeck, a member of technical staff at OpenAI, highlighted its importance during the recent GPT-5 release, an sentiment echoed by OpenAI CEO Sam Altman. The promise is clear: synthetic data could fuel the next generation of AI capabilities, enabling more intelligent and capable products like ChatGPT, which proponents argue will enhance productivity, foster learning, and drive global innovation. OpenAI maintains that its synthetic data generation adheres to relevant copyright laws.

Yet, this burgeoning reliance on machine-generated data has not gone unnoticed by the creative industries, sparking considerable apprehension. Reid Southern, a film concept artist and illustrator, suggests that AI companies are turning to synthetic data precisely because they have exhausted the supply of high-quality, human-created content available on the public internet. More pointedly, Southern believes there’s an ulterior motive: to distance themselves from any copyrighted materials their models might have initially trained on, thereby avoiding potential legal pitfalls.

Southern has publicly labeled this practice “data laundering.” He argues that AI firms could first train their models on copyrighted works, then generate new, AI-varied content based on that learning, and subsequently remove the original copyrighted material from their datasets. By this logic, they could then claim their training sets are “ethical” because they didn’t “technically” train on the original copyrighted image. Southern asserts that this process attempts to “clean the data and strip it of its copyright.”

Felix Simon, an AI researcher at the University of Oxford, offers a more nuanced perspective, acknowledging that while synthetic data might appear to offer a solution, it doesn’t fundamentally “remediate the original harm” caused to creators. He points out that synthetic data isn’t conjured from nothing; it is presumably created by models that themselves were trained on existing data from creators and copyright holders—often without their explicit permission or compensation. From a perspective of societal justice, rights, and duties, Simon contends that these rights holders are still owed something, be it compensation, acknowledgement, or both, even when synthetic data is employed.

Ed Newton-Rex, founder of Fairly Trained—a non-profit that certifies AI companies respecting intellectual property rights—shares Southern’s concerns. He concedes that synthetic data can be a genuinely helpful tool for augmenting datasets and increasing the coverage of training data, especially as AI development approaches the limits of legitimately accessible information. However, he also identifies a “darker side,” agreeing that its effect is, at least in part, a form of copyright laundering.

Newton-Rex cautions against accepting AI firms’ assurances at face value, emphasizing that synthetic data is “not a panacea” for the critical copyright questions facing the industry. He warns against the pervasive, yet mistaken, belief among some AI developers that synthetic data can help them circumvent copyright concerns. Furthermore, he argues that the very framing of synthetic data—and the way AI companies discuss model training—serves to obscure the origins of their models and distance them from the individual creators whose work they may be using. He likens it to plastic recycling, where a recycled container’s origins are obscured in its new form; similarly, AI models “mash all this stuff up and generate ‘new output’” without reducing their reliance on the original work. For Newton-Rex, the crucial takeaway remains that even in a world reliant on synthetic data, “people’s work is being exploited in order to compete with them.”