Synthetic Data Generation for ML: When It Helps, When It Hurts
LLM-generated training data is everywhere. When it accelerates model development and when it quietly poisons it.
Synthetic data went from research curiosity to default practice in two years. Every team has tried generating training data with an LLM. Half are happy; half are debugging mysterious regressions. The line between the two camps is mostly about discipline.
When synthetic data helps, when it hurts, and the rules we apply.
Where it helps#
Augmenting underrepresented classes. Real data is class-imbalanced; you can’t collect more of the rare class. Synthetic examples in the minority class — generated carefully and validated — often improve recall without harming precision.
Format diversity. Same underlying intent, expressed in 20 different ways. LLMs are good at producing paraphrases. The downstream model becomes robust to surface variation.
Edge cases that are hard to collect. Adversarial inputs. Hostile parsing inputs. Out-of-distribution probes. Synthetic generation lets you specify the edge cases you want to stress-test.
Cold start. A new feature with no labeled data. Bootstrapped with synthetic data, then refined with real data as it arrives. Don’t ship to production purely on synthetic; do iterate on it.
Where it hurts#
Distribution drift. The model trains on a distribution the generator produced — which is not your production distribution. The model performs great on eval (because the eval set was also generated) and badly in production.
Hallucinated labels. The LLM-generated “label” is wrong, often subtly. The model learns the LLM’s biases.
Loss of variation. LLM-generated text has telltale stylistic patterns. The downstream model learns to recognize those patterns instead of the underlying signal. When real-world inputs diverge stylistically, performance collapses.
Compounding errors. Synthetic data trains a model, the model generates more synthetic data, and so on. The signal degenerates. The literature calls this “model collapse” — we’ve seen it happen in production within months.
The rules we apply#
Always validate against real data. Eval set is always real, never synthetic. If your eval is synthetic too, you’ve calibrated against the generator’s quirks.
Mix in real data. No production model trains on purely synthetic data. Even 10% real often saves the model from generator artifacts.
Generate with diverse seeds. Different generation strategies (different LLMs, different prompts, different temperatures) reduce the “LLM fingerprint” in the training distribution.
Audit a sample. Have a domain expert review 50–100 generated examples. The rate of obviously wrong labels tells you the noise floor.
Avoid recursive generation. Don’t train a model on synthetic data and then use that model to generate more synthetic data. The line we recommend: synthetic data should originate from a model your downstream model is not derived from.
When synthetic data accelerates honest progress#
The honest use case: synthetic data shortens the time to a working prototype and the first deployment. Real data still drives the long-term quality curve.
A working pattern:
- Phase 0: 100% synthetic, ship prototype, get production exposure
- Phase 1: 50/50, collecting real labels via active learning
- Phase 2: 80/20 toward real
- Phase 3: 95/5, retain a small synthetic slice for adversarial/edge-case robustness
The phase progression matters. Teams that stay at 100% synthetic past Phase 1 ship models with synthetic baggage.
What we ship by default#
For ML engagements via our AI & LLM integration service:
- Eval set is always real and version-locked
- Synthetic data flagged in the data lineage so the team knows the proportion
- Expert audit of a synthetic sample before any training run that uses it
- Mix-in policy enforced (never 100% synthetic past prototype)
- Retraining cadence designed to migrate toward real data
Synthetic data is a useful accelerant. It’s not a substitute for the real-world distribution.
Synthetic data is fuel, not the engine. Our team ships ML systems that use synthetic data productively without poisoning the production distribution. Tell us about the dataset.