Use Cases
Where real data runs out, Twynvex picks up.
Edge cases, rare classes, privacy-blocked datasets — three patterns that synthetic generation solves without touching real records.
Class imbalance at 0.01%–0.5% positive rate
Fraud is rare by definition. A dataset with 0.1% fraud rate contains 1 fraud row for every 1,000 legitimate transactions. Standard oversampling methods (SMOTE, ADASYN) degrade significantly below 0.5% class ratio — they create synthetic samples by interpolating between existing fraud examples, but when there are only 40 fraud rows in a dataset of 40,000, the synthetic examples are noise.
Twynvex generates schema-faithful synthetic fraud patterns from a behavioral schema — not from interpolating existing fraud rows. The generation engine preserves the joint distribution between transaction amount, merchant category, time-of-day, and the fraud label. Output is a balanced training set with a configurable class ratio and marginal distributions that match your real data's statistical shape.
Outcome
Measurable AUC improvement on real held-out test set. Fraud class raised from 0.08% to 15% in one generation run, downstream XGBoost trained without preprocessing changes.
job = schema.generate(
num_rows=500_000,
tail_weight=0.4,
label_ratio={"fraud": 0.15}
)
schema = twx.from_jsonl("intents.jsonl")
job = schema.generate(
label_min_count=200, # ensure all intents have 200+
output_format="jsonl"
)
Long-tail intents with fewer than 50 real examples
Enterprise chatbot classifiers and ticket routing models commonly have 200+ intent classes. The top 20 intents get thousands of training examples. The bottom 180 get fewer than 50 — sometimes fewer than 10. The model learns to ignore those classes or conflate them with neighbors.
Twynvex generates synthetic text samples for under-represented intents using structural templates derived from real examples in that class. Semantic variation constraints enforce minimum edit distance between generated samples, so the output diversifies surface phrasing without producing near-duplicate rows. Each generated sample carries its intent label — no annotation step needed. The JSONL output drops directly into your OpenAI fine-tuning or HuggingFace datasets pipeline.
Outcome
JSONL output compatible with OpenAI fine-tuning and HuggingFace datasets library. All 200+ intents reach minimum training threshold in a single generation run.
Real patient records blocked by HIPAA
Clinical ML model development stalls when data can't leave the hospital's on-premises environment. The modeling team needs patient records to train a risk stratification model, but HIPAA-covered data can't be exported to a cloud ML platform without a chain of compliance work that takes months.
Twynvex generates a synthetic patient cohort from a schema defined by the clinical data team — configurable demographic distributions, diagnosis co-occurrence rates, lab value ranges, and temporal event sequences. The schema is defined on-premises; the generation runs locally or in a private cloud environment. Output is privacy-clean by construction — no real patient record is used as input to the generation engine.
This is a privacy-by-construction approach, not a HIPAA compliance claim. Whether synthetic data satisfies your organization's specific data governance policy is a legal determination. We describe the technical architecture; your legal team makes the compliance call.
Outcome
Privacy-clean synthetic cohort shipped to cloud ML platform. Model development proceeds without waiting for compliance review of real records.
schema = twx.Schema()
schema.add_column("age", type="int", dist="normal", mu=52)
schema.add_column("diagnosis", type="categorical")
schema.add_cooccurrence("diagnosis", cooccur_matrix)
job = schema.generate(num_rows=10_000)