AI's Synthetic Data Boom: Innovation vs. Copyright Concerns
The rapid pace of artificial intelligence development faces a looming challenge: a potential shortage of high-quality training data. As websites increasingly erect barriers to data collection, and existing online content is voraciously scraped to fuel AI model training, concerns are mounting that the well of usable information may soon run dry. The industry’s proposed solution is increasingly clear: synthetic data.
“Recently in the industry, synthetic data has been talked about a lot,” stated Sebastien Bubeck, a member of technical staff at OpenAI, during the company’s recent GPT-5 release event. Bubeck underscored its pivotal role for the future of AI models, a sentiment echoed by OpenAI CEO Sam Altman, who conveyed his excitement for “much more to come.”
However, the prospect of heavy reliance on AI-generated data has not gone unnoticed by the creative industries. Reid Southern, a film concept artist and illustrator, suggests that AI companies like OpenAI are turning to synthetic data primarily because they have exhausted the supply of high-quality, human-created content available on the public internet. Southern also posits a more controversial motive: “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”
For this reason, Southern has publicly dubbed the practice “data laundering.” He argues that AI companies could initially train their models on copyrighted works, subsequently generate AI variations of that content, and then remove the original copyrighted material from their training datasets. This strategy, he claims, would allow them to assert that their training set is “ethical” because, by their logic, it did not “technically” train on the original copyrighted image. “That’s why we call it data laundering,” Southern explains, “because in a sense, they’re attempting to clean the data and strip it of its copyright.”
In response, an OpenAI spokesperson affirmed the company’s commitment to responsible development: “We create synthetic data to advance AI, in line with relevant copyright laws.” The spokesperson added that generating high-quality synthetic data enables them to build more intelligent and capable products like ChatGPT, which empower millions to work more efficiently, discover new ways to learn and create, and foster global innovation and competition.
Felix Simon, an AI researcher at the University of Oxford, views the issue with a more nuanced lens. He points out that while synthetic data might appear to offer a clean slate, it “doesn’t really remediate the original harm over which creators and AI firms squabble.” He emphasizes that synthetic data is not conjured from thin air; it is presumably created by models that have themselves been trained on data from creators and copyright holders, often without permission or compensation. From a perspective of societal justice, rights, and duties, Simon asserts that “these rights holders still are owed something even with the use of synthetic data—be that compensation, acknowledgements, or both.”
Ed Newton-Rex, founder of Fairly Trained—a non-profit that certifies AI companies respecting creators’ intellectual property rights—shares Southern’s fundamental concerns. He acknowledges synthetic data’s legitimate utility as a means to “augment your dataset” and “increase the coverage of your training data.” At a time when the industry is “butting up against the limits of legitimately accessible training data,” synthetic data is perceived as a way to “extend the usable life of that data.”
However, Newton-Rex also cautions against its darker implications. “At the same time, I think unfortunately its effect is, at least in part, one of copyright laundering,” he states, concluding that “both are true.” He warns against blindly accepting AI firms’ assurances, stressing that synthetic data is “not a panacea from the incredibly important copyright questions.” The notion that synthetic data allows AI developers to circumvent copyright concerns is, in his view, fundamentally mistaken.
Newton-Rex further argues that the very framing of synthetic data, and how AI companies discuss model training, serves to distance them from the individuals whose work they may be utilizing. “The average listener, if they hear this model was trained on synthetic data, they’re bound to think, ‘Oh, right, okay. Well, this probably isn’t Ed Sheeran’s latest album, right?’” he posits. This narrative, he contends, “further moves us away from an easy understanding of how these models are actually made, which is ultimately by exploiting people’s life’s work.” He draws an analogy to plastic recycling, where a recycled container might have originated as a toy or a car bumper. The act of AI models mashing up diverse inputs to generate “new output” does nothing, he maintains, to diminish their reliance on the original human work.
For Newton-Rex, the critical takeaway remains: “Really the absolutely critical element here, and it’s just got to be remembered, is that even in a world of synthetic data, what’s happening is people’s work is being exploited in order to compete with them.”