Distribution Matching in Synthetic Data: What It Means and Why It Matters

Distribution Matching in Synthetic Data

When we talk about generating "realistic" synthetic data, what we actually mean is generating data whose statistical properties match those of real data. "Realistic-looking" rows aren't useful if your model trains on them and then fails on real data. The measure that matters is distributional fidelity — and understanding what that requires is more nuanced than it first appears.

Marginal distributions: the baseline requirement

The simplest form of distributional fidelity is matching marginal distributions: the distribution of each column independently. If your real dataset has an age column with a roughly normal distribution centered at 42 with standard deviation 14, your synthetic age column should have the same shape. If your transaction_type categorical column has 60% "purchase", 25% "refund", 15% "dispute", your synthetic column should approximate those proportions.

Marginal matching is necessary but not sufficient. A generator that samples each column independently from fitted marginals will produce rows where age and income are statistically independent — which is wrong. In real data, there's a positive correlation between age and income up to a point, then it plateaus. A model trained on data where this correlation is absent will learn a different decision boundary than one trained on realistic data.

Pairwise correlations and the joint distribution

The full joint distribution P(X₁, X₂, ..., Xₙ) is what we really want to match. For continuous features, this includes all pairwise correlations and higher-order interactions. For categorical features, it includes the joint probability mass over all combinations of categories.

Fitting the true joint distribution from a sample is statistically intractable for high-dimensional data — the number of parameters grows exponentially with the number of columns. Practical synthetic generation approaches this in two ways: either by modeling a factored approximation (marginals + pairwise correlations, ignoring higher-order terms) or by learning a latent representation that implicitly captures multi-column dependencies (the GAN/VAE approach).

The factored approach has an advantage for tabular data: it's interpretable and auditable. You can directly measure whether the pairwise correlation between age and income in synthetic data matches the real data. The latent representation approach may capture more complex dependencies but makes it harder to diagnose when something goes wrong.

In Twynvex, we model marginals and pairwise correlations as the primary fidelity target, with post-generation constraint solving to enforce schema-level rules. For most tabular ML training use cases, pairwise correlation fidelity captures the distributional structure that matters most for model training — the higher-order interactions tend to matter more for generation of sequences or structured text than for independent-row tabular data.

Jensen-Shannon divergence as a fidelity metric

Once you have synthetic data, you need a way to measure how well it matches the real distribution. The most commonly used metric for continuous distributions is Jensen-Shannon (JS) divergence — a symmetric, bounded version of KL divergence that measures the difference between two probability distributions.

JS divergence ranges from 0 (identical distributions) to 1 (maximally different distributions, when using log base 2). For each continuous column, you can compute JS divergence between the empirical distribution of real values and the empirical distribution of synthetic values, using binned histograms or kernel density estimates.

A JS divergence near 0 for a column means the synthetic generator is capturing that column's distribution accurately. Values above 0.1 typically indicate meaningful distributional shift. Values above 0.3 suggest the generator has failed to capture that column's distribution and synthetic rows for that feature will be misleading to a model trained on them.

For categorical columns, you compare the frequency distributions directly — chi-squared distance or total variation distance are cleaner than JS divergence for discrete distributions.

In Twynvex's quality reports, we surface per-column JS divergence and flag columns where fidelity is below threshold. This gives you a diagnostic that tells you specifically which features are being synthesized accurately and which aren't — rather than an aggregate score that can obscure individual column failures.

The TSTR framework: utility as the real validation

Distribution matching metrics tell you whether synthetic data looks like real data statistically. But the test that matters for ML training is whether a model trained on synthetic data performs comparably to a model trained on real data. The Train on Synthetic, Test on Real (TSTR) framework formalizes this.

The process: train a downstream model (binary classifier, regressor, etc.) on synthetic data only. Evaluate it on a held-out real test set. Compare the AUC (or relevant metric) to a baseline model trained on real data. The ratio of synthetic-trained to real-trained performance tells you the ML utility of the synthetic data.

A TSTR AUC ratio close to 1.0 means the synthetic data captures enough of the real distribution's structure to produce a comparably-performing model. A ratio significantly below 1.0 (say, 0.75 or lower) means the synthetic data is missing some distributional structure that the model relies on for real predictions.

TSTR is more expensive than distribution distance metrics — it requires training and evaluating an additional model — but it's the ground truth for whether synthetic data is useful. We recommend running TSTR at generation time when building a new generation configuration, and using distribution metrics for faster ongoing quality checks.

Why naive random generation fails

It's worth being explicit about what doesn't work. If you generate synthetic rows by sampling each column independently from a Gaussian or uniform distribution, without fitting to real data at all, you get rows that are structurally wrong in several ways. Categorical cardinality will be wrong (columns that have 5 valid categories will have invented values). Numeric ranges will be wrong (negative ages, transaction amounts in the billions). Correlations will be absent. Constraints between columns will be violated.

This kind of naive generation is useless for model training and actively harmful if the trained model is ever expected to generalize to real data. Yet we've seen it in internal data science notebooks where someone needed to quickly generate test data and didn't have access to real data. The output looks fine to human inspection — it's tabular, it has the right columns, values are numbers and strings. But the statistical structure is completely missing.

The same problem appears in a subtler form with SMOTE (Synthetic Minority Over-sampling Technique): SMOTE interpolates between existing minority class examples in feature space, which preserves some local structure but doesn't correctly model the minority class distribution. At very low positive rates (0.01% fraud, for instance), SMOTE doesn't have enough real examples to interpolate from and the resulting synthetic samples cluster near the few real positives rather than covering the true minority distribution.

What good distribution matching actually requires of your input sample

A subtle point worth raising: distribution matching from a sample requires that your sample is itself representative of the real distribution. If your 10,000-row sample has a selection bias — say, it was drawn from transactions above a certain amount threshold, or from a specific geographic region — the generator will learn that biased distribution and produce synthetic data that inherits the bias.

We're not saying distribution matching can fix a biased sample. It can't, and it doesn't claim to. What it can do is faithfully reproduce the statistical structure of the sample you provide. If you feed it a representative sample, you get a generator that produces representative synthetic rows. If you feed it a biased sample, you get synthetic rows with the same bias. The quality of the input governs the quality of the output — distribution matching is a faithful modeling process, not a data cleaning step.

This matters particularly for tail-targeted generation: when you configure Twynvex to oversample the tail of a distribution, you're intentionally shifting away from the source sample's distribution. The question is whether the tail samples in your source accurately represent the real-world distribution of tail events, or whether they're themselves sparse and potentially biased. Auditing your source data before configuring generation is time worth spending.