Labeling at Scale: Why the Synthetic Approach Beats Manual Annotation

Labeling at Scale

Manual labeling is a reasonable solution when your data volume is manageable and your class distribution is balanced. It breaks down on both counts when you need rare-class coverage. The label doesn't become cheap just because you outsource the work — the rare events need to occur before they can be labeled, and that's the constraint that no annotation budget can fix.

The annotation economics problem

The typical annotation pipeline: collect raw data, send to annotators, get labels back, train model. This works fine when positive examples are plentiful. At 5% positive rate, you need to collect 20 samples per positive label. At 0.1% positive rate — roughly what you see in credit card fraud, some industrial fault detection scenarios, and rare medical conditions — you need to collect 1,000 samples per positive label.

That denominator isn't just a cost problem. It's a time problem. If you're waiting for rare events to occur naturally, your data collection timeline is governed by the event's frequency in the real world. A medical device company building a classifier for a rare failure mode that occurs in 0.05% of device-days would need to accumulate millions of device-days of data to build even a minimally adequate training set for that failure class.

Active learning helps at the margins — by sampling from regions of uncertainty rather than uniformly, you can improve label efficiency. But active learning still requires the underlying rare events to occur. You're improving your sampling strategy, not changing the fundamental data scarcity.

The class imbalance compound problem

Beyond annotation cost, rare classes create a model training problem. Standard cross-entropy loss treats each sample equally, meaning rare class samples contribute proportionally little to the gradient. A classifier trained on data with 0.1% positive rate will learn to predict the negative class almost everywhere and still achieve 99.9% accuracy. The metrics look good; the model doesn't do what you built it to do.

Standard remedies: class weighting (multiply positive class loss by imbalance ratio), oversampling (SMOTE or random oversampling), undersampling (discard majority class samples). All of these work at the margins. Class weighting changes the loss landscape but doesn't add distributional coverage. SMOTE interpolates between existing positives but at very low positive rates has too few real examples to generalize from. Undersampling throws away majority class information.

None of these approaches add new coverage of the rare class's distribution. They reweight or redistribute existing coverage. When you have 40 fraud examples in a dataset of 400,000 transactions, no reweighting scheme gives you adequate coverage of the fraud distribution — 40 points are simply insufficient to characterize what fraud looks like across the full feature space.

The synthetic labeling flip: labels come first

Synthetic generation inverts the annotation process. Instead of collecting raw data and then labeling it, you specify the label first and generate rows that match that label's characteristics.

In practice with Twynvex: you configure a generation job with target class = "fraud", specify the feature distributions you want the synthetic fraud rows to have (based on the real fraud examples you do have, however few), and request a target volume — say, 50,000 rows. The output is a labeled dataset where every row has label = "fraud" and structurally realistic feature values for fraud transactions. No annotation step required, because the label was part of the generation specification, not a post-hoc assignment.

This only works when you have domain knowledge to inform the generation configuration — you need to know what fraud looks like in feature space, even approximately. That's usually available: domain experts know that certain merchant categories, time-of-day patterns, and amount ranges are characteristic of fraud, even if those combinations are rare in the training data. That domain knowledge becomes the generator's configuration, not the training data.

Comparing the two approaches on a concrete case

Consider an NLP team building an intent classifier for a customer support routing system. The classifier has 180 intent classes. The long tail of intents — "request refund for subscription renewal that was missed," "escalate to compliance team for regulatory complaint," and a dozen similar specific intents — each have fewer than 30 real training examples. The classifier handles the high-volume intents well but systematically misroutes the low-volume ones.

Option A: manual annotation. Hire annotators, collect more support tickets, label for the rare intents. Problems: rare intents occur infrequently by definition, so collection takes months; annotators need domain expertise to correctly label the subtler rare intents; the process needs to repeat every time the intent taxonomy changes.

Option B: synthetic generation for rare intents. Given 20-30 real examples of each rare intent as seed data, generate 500 synthetic examples per intent that preserve the structural patterns (vocabulary, sentence length, key entity types) of the real examples. The 20 real examples become the statistical sample from which the generator learns; the 500 synthetic examples provide adequate class coverage for training. Labels are embedded in the generation specification, no annotation needed.

We're not saying the synthetic approach produces better training data than large volumes of real labeled data. A dataset with 2,000 real examples per intent class would be better than 500 synthetic examples per class. The question is whether 2,000 real examples per rare intent class is achievable — and for long-tail classification problems, it often isn't within any reasonable project timeline.

Validation: how do you know synthetic labels are meaningful?

The reasonable concern: if labels come from a generation configuration rather than human annotation of real events, how do you know the synthetic positive class actually represents what you think it does?

Two checks we find useful. First, generate a held-out synthetic set (separate from training) and have domain experts review a sample. They're reviewing whether the synthetic fraud rows look like plausible fraud, not whether they're real transactions. This is a faster and cheaper review than annotating real data, and it validates that the generation configuration is capturing the right patterns. Second, train on synthetic-augmented data and evaluate on a real holdout set. If recall on the rare class improves on real positive examples, the synthetic labels correspond to real patterns that the model now generalizes from.

Both checks involve human judgment at some point — the synthetic approach doesn't remove expert knowledge from the pipeline, it concentrates it in the generation configuration step rather than distributing it across thousands of annotation decisions. For specialized domains, that concentration often makes the overall process faster and more consistent.