Winning Synthetic Data Challenge with Post-Processing Approach
A recent triumph in the “Mostly AI Prize” competition has illuminated a crucial insight into synthetic data generation: while advanced machine learning models are indispensable, achieving high-fidelity data often hinges on sophisticated post-processing. The winning solution, which secured top honors in both the FLAT and SEQUENTIAL data challenges, demonstrated how meticulous refinement can elevate raw model output to near-perfect statistical alignment with source data.
The “Mostly AI Prize” aimed to generate synthetic datasets that precisely mirrored the statistical characteristics of original source data, crucially, without any direct copying of real records. The competition featured two distinct challenges: the FLAT Data Challenge, requiring 100,000 records across 80 columns, and the SEQUENTIAL Data Challenge, involving 20,000 sequences of records. Data quality was rigorously evaluated using an “Overall Accuracy” metric, quantifying the L1 distance—a measure of difference—between synthetic and source data distributions across single, paired, and triple columns. To safeguard against overfitting or data replication, privacy metrics like Distance to Closest Record (DCR) and Nearest Neighbor Distance Ratio (NNDR) were also applied.
Initial efforts exploring an ensemble of state-of-the-art generative models yielded only marginal improvements. The pivotal shift came with an intensive focus on post-processing. The strategy involved training a single generative model from the Mostly AI SDK, then oversampling to create a significantly larger pool of candidate samples. From this extensive pool, the final output was meticulously selected and refined. This approach dramatically boosted performance: for the FLAT data challenge, raw synthetic data scored around 0.96, but after post-processing, the score soared to an impressive 0.992. A modified version delivered similar gains in the SEQUENTIAL challenge.
The final pipeline for the FLAT challenge comprised three principal steps: Iterative Proportional Fitting (IPF), Greedy Trimming, and Iterative Refinement.
IPF served as the crucial first step, selecting a high-quality, oversized subset from an initial pool of 2.5 million generated rows. This classical statistical algorithm adjusted the synthetic data’s bivariate (two-column) distributions to closely match those of the original data. Focusing on the 5,000 most correlated column pairs, IPF calculated fractional weights for each synthetic row, iteratively adjusting them until bivariate distributions aligned with the target. These weights were then converted into integer counts, yielding an oversized subset of 125,000 rows—1.25 times the required size—already boasting strong bivariate accuracy.
The oversized subset then underwent a Greedy Trimming phase. This iterative process calculated the “error contribution” of each row, systematically removing those contributing most to the statistical distance from the target distribution. This continued until precisely 100,000 rows remained, discarding the least accurate samples.
The final stage, Iterative Refinement, involved a sophisticated swapping process. The algorithm iteratively identified the worst-performing rows within the 100,000-row subset and searched the remaining 2.4 million rows in the unused data pool for optimal replacement candidates. A swap was executed only if it led to an improved overall score, providing a crucial final polish.
The SEQUENTIAL challenge introduced unique complexities: samples were groups of rows, and a “coherence” metric evaluated how well the sequences of events resembled the source data. The post-processing pipeline was adapted accordingly. A Coherence-Based Pre-selection step was introduced first, iteratively swapping entire groups to align with the original data’s coherence metrics, such as the distribution of “unique categories per sequence.” This ensured a sound sequential structure. The 20,000 coherence-optimized groups then underwent a statistical Refinement (Swapping) process similar to the flat data, where entire groups were swapped to minimize L1 error across univariate, bivariate, and trivariate distributions. Notably, “Sequence Length” was included as a feature to ensure group lengths were considered. The IPF approach, effective for flat data, proved less beneficial here and was omitted to reallocate computational resources.
The computationally intensive post-processing strategy demanded significant optimization to meet time limits. Key techniques included reducing data types (e.g., from 64-bit to 32-bit or 16-bit) to manage memory. Sparse matrices from SciPy were employed for efficient storage of statistical contributions. Furthermore, for core refinement loops with specialized calculations that were slow in standard NumPy, Numba was leveraged. By decorating bottleneck functions with @numba.njit
, Numba automatically translated them into highly optimized machine code, achieving speeds comparable to C, though Numba was used judiciously for specific numerical bottlenecks.
This victory underscores a vital lesson for data scientists: the “secret ingredient” often extends beyond the generative model itself. While a robust model forms the foundation, the pre- and post-processing stages are equally, if not more, critical. For these synthetic data challenges, a meticulously designed post-processing pipeline, specifically tailored to the evaluation metrics, proved to be the decisive factor, securing the win without requiring additional machine learning model development. The competition reinforced the profound impact of data engineering and statistical refinement in achieving high-fidelity synthetic data.