Use Cases

Where real data runs out, Twynvex picks up.

Edge cases, rare classes, privacy-blocked datasets — three patterns that synthetic generation solves without touching real records.

Use Case 01 — Fraud Detection

Class imbalance at 0.01%–0.5% positive rate

Fraud is rare by definition. A dataset with 0.1% fraud rate contains 1 fraud row for every 1,000 legitimate transactions. Standard oversampling methods (SMOTE, ADASYN) degrade significantly below 0.5% class ratio — they create synthetic samples by interpolating between existing fraud examples, but when there are only 40 fraud rows in a dataset of 40,000, the synthetic examples are noise.

Twynvex generates schema-faithful synthetic fraud patterns from a behavioral schema — not from interpolating existing fraud rows. The generation engine preserves the joint distribution between transaction amount, merchant category, time-of-day, and the fraud label. Output is a balanced training set with a configurable class ratio and marginal distributions that match your real data's statistical shape.

Outcome

Measurable AUC improvement on real held-out test set. Fraud class raised from 0.08% to 15% in one generation run, downstream XGBoost trained without preprocessing changes.

Input Transaction CSV with behavioral columns + fraud label
tail_weight 0.4 — 40% of rows from distribution tail regions
label_ratio {"fraud": 0.15} — 15% positive class in output
Output format CSV or Parquet, same schema as input
Privacy Zero real transaction records in output
job = schema.generate(
  num_rows=500_000,
  tail_weight=0.4,
  label_ratio={"fraud": 0.15}
)
Input JSONL intent training file with text + label columns
Target intents Any intent class with fewer than 50 real examples
Generation mode Structural template + semantic variation constraints
Output format JSONL — compatible with OpenAI fine-tuning + HuggingFace datasets
schema = twx.from_jsonl("intents.jsonl")
job = schema.generate(
  label_min_count=200,  # ensure all intents have 200+
  output_format="jsonl"
)
Use Case 02 — NLP Classification

Long-tail intents with fewer than 50 real examples

Enterprise chatbot classifiers and ticket routing models commonly have 200+ intent classes. The top 20 intents get thousands of training examples. The bottom 180 get fewer than 50 — sometimes fewer than 10. The model learns to ignore those classes or conflate them with neighbors.

Twynvex generates synthetic text samples for under-represented intents using structural templates derived from real examples in that class. Semantic variation constraints enforce minimum edit distance between generated samples, so the output diversifies surface phrasing without producing near-duplicate rows. Each generated sample carries its intent label — no annotation step needed. The JSONL output drops directly into your OpenAI fine-tuning or HuggingFace datasets pipeline.

Outcome

JSONL output compatible with OpenAI fine-tuning and HuggingFace datasets library. All 200+ intents reach minimum training threshold in a single generation run.

Use Case 03 — Healthcare Tabular

Real patient records blocked by HIPAA

Clinical ML model development stalls when data can't leave the hospital's on-premises environment. The modeling team needs patient records to train a risk stratification model, but HIPAA-covered data can't be exported to a cloud ML platform without a chain of compliance work that takes months.

Twynvex generates a synthetic patient cohort from a schema defined by the clinical data team — configurable demographic distributions, diagnosis co-occurrence rates, lab value ranges, and temporal event sequences. The schema is defined on-premises; the generation runs locally or in a private cloud environment. Output is privacy-clean by construction — no real patient record is used as input to the generation engine.

This is a privacy-by-construction approach, not a HIPAA compliance claim. Whether synthetic data satisfies your organization's specific data governance policy is a legal determination. We describe the technical architecture; your legal team makes the compliance call.

Outcome

Privacy-clean synthetic cohort shipped to cloud ML platform. Model development proceeds without waiting for compliance review of real records.

Input Schema definition (no real records required)
Configurable Demographic distributions, ICD-10 co-occurrence, lab ranges, temporal sequences
Output format CSV or Parquet, HL7-style column naming compatible with common EHR exports
Privacy model Privacy-by-construction. No real records in output. Legal determination is yours.
schema = twx.Schema()
schema.add_column("age", type="int", dist="normal", mu=52)
schema.add_column("diagnosis", type="categorical")
schema.add_cooccurrence("diagnosis", cooccur_matrix)
job = schema.generate(num_rows=10_000)

What does your tail look like?