TL;DR: I built a 1,400-case classification benchmark with labels generated by Claude Opus 4.5. When I asked Opus to re-classify its own specs, it agreed with itself 99.86% of the time. Other models—GPT-5-mini, Grok, Gemini, Haiku—hit a ceiling around 78%. That 20-point gap isn’t model failure. It’s the space where different systems learned different (but valid) classification patterns. Ground truth isn’t truth—it’s one model’s opinion.

It was really windy last night. Not bitter cold—mid-30s—but the sound of the wind rattling the windows made it feel colder. Perfect excuse to fire up those 2,400W space heaters made by NVIDIA that sit in the basement underneath my family room. The rack is positioned strategically now: three CPUs, five GPUs, and a NAS array all pumping heat up through the floorboards. My wife gets a warm family room; I get an excuse to run experiments.

Tonight’s question has been nagging at me for weeks: If an AI generates labeled data, can it reproduce its own labels?

The Setup

I’ve been running a benchmark evaluating LLMs on a domain-specific classification task: categorizing commercial construction specifications into standardized industry codes. This work supports teraContext.AI’s commercial construction expert system. 1,400 specs across 35 categories—concrete, HVAC, plumbing, industrial equipment, the whole taxonomy.

The ground truth labels were generated by Claude Opus 4.5 using a simple prompt. Each spec was created to fit a specific category, so in theory, the “correct” answer is known.

Then I ran six models against it:

Model	Accuracy	Avg Response Time
GPT-5-mini	79.7%	13.6s
gpt-oss:120b (local, open-weight)	78.0%	2.2s
Grok 4.1	77.6%	6.7s
Gemini 3 Flash	76.8%	1.6s
Claude Haiku 4.5	76.0%	2.8s
Devstral-2512	75.2%	2.6s

Every model clustered between 75-80%. Different companies, different architectures, different training data—yet they all hit the same ceiling.

Why couldn’t any model break 80%?

The Uncomfortable Question

Here’s what made me pause: in 84 cases, all six models unanimously agreed on an answer—and that answer disagreed with the ground truth.

Six independent systems. Different training. Same conclusion. All marked “wrong.”

At some point you have to ask: what if the ground truth is the problem?

Testing Self-Consistency

So I ran the obvious experiment. Opus 4.5 created these labels. Can Opus 4.5 reproduce them?

Running Opus through the OpenRouter API would cost a fortune for 1,400 classifications. But I use Claude Code for development, and Claude Code runs on Opus 4.5. So I just asked it to classify the specs directly during our session—same enhanced prompt given to other models, no API costs.

The results:

Metric	Value
Total Specs	1,400
Self-Agreements	1,398
Self-Consistency	99.86%

Two disagreements. Out of 1,400.

The Two Edge Cases

Both disagreements were genuine ambiguities:

Spec 111 — Radon mitigation system: Originally labeled as HVAC (it involves ventilation). On reflection, equally valid as environmental remediation (it addresses a hazardous contaminant). The spec text mentions “sub-slab depressurization” which could go either way.

Spec 1083 — Distributed control system: Originally labeled as building automation. On reflection, equally valid as industrial process control. The spec mentions “redundant controllers and I/O modules”—that’s building automation language, but “operator workstations” suggests industrial scale.

Neither classification is wrong. Both are defensible. The ambiguity is inherent to the domain, not a labeling error.

What This Means

The Benchmark Is Valid

If Opus couldn’t reproduce its own labels, the benchmark would have a consistency problem. But 99.86% self-agreement proves the opposite: the labels are coherent and reproducible. The ceiling other models hit isn’t benchmark noise.

Other Models Think Differently

When GPT-5-mini scores 79.7%, it’s not making random errors on 20% of the data. It’s making systematic classification choices that differ from how Opus categorizes things.

The inter-model agreement patterns prove this:

Model Pair	Agreement
GPT-5-mini + Grok	87.4%
Grok + Gemini	87.3%
Gemini + GPT-5-mini	83.4%

GPT-5-mini and Grok agree with each other 87% of the time—but both only agree with Opus ground truth about 78% of the time. They’ve learned similar classification patterns to each other, but different patterns from Opus.

Agreement Is a Signal

When all six models agree on an answer that differs from ground truth, that’s not random noise. I found 84 such cases. Either:

All six models made the same mistake (unlikely—they’re independent systems)
The ground truth is the outlier

Option 2 seems more plausible. If we exclude those 84 unanimous-disagreement cases, adjusted accuracy jumps:

Model	Raw	Adjusted
GPT-5-mini	79.7%	~84.6%
gpt-oss:120b	78.0%	~83.0%
Grok 4.1	77.6%	~82.5%
Gemini 3 Flash	76.8%	~81.7%

The 78% ceiling wasn’t a performance limit. It was label noise.

Who Wins When Models Disagree?

This is where GPT-5-mini earned its top spot. It doesn’t just score highest—it wins head-to-head disagreements against every other model:

When GPT-5-mini Disagrees With	GPT-5-mini Wins	Other Wins
Grok	89	59
Gemini	117	76
Haiku	134	82
Devstral	145	82
gpt-oss:120b	135	111

When GPT-5-mini disagrees with any other model, it’s correct more often than not. It’s making better judgment calls on the ambiguous cases.

gpt-oss:120b shows a similar pattern against the smaller cloud models—it wins its disagreements against Grok, Gemini, Haiku, and Devstral. It only loses to GPT-5-mini. The “different thinker” isn’t wrong more often—it’s right in different ways.

The Philosophy of Ground Truth

Here’s the uncomfortable insight that keeps surfacing in these benchmarks: ground truth is just another model’s opinion.

Our “correct” labels came from Opus classifying specs it created. That’s not objective truth—it’s one system’s interpretation of a classification schema. A thoughtful, consistent interpretation (99.86% reproducible!), but an interpretation nonetheless.

When we treat ground truth as infallible, we make a category error. Labels are hypotheses, not facts. They’re the starting point for evaluation, not the final word.

The 99.86% self-consistency proves Opus has a coherent internal model for construction classification. But that doesn’t make its classifications universally “correct”—it makes them one valid way to categorize this domain.

Other models learned different-but-valid patterns. The gap between 78% and 99.86% isn’t error. It’s the space where reasonable systems can reasonably disagree.

What This Means for Your Business

For benchmark designers:

Test your labeler’s self-consistency. If they can’t reproduce their own labels, your benchmark has problems.
Unanimous cross-model disagreement with ground truth isn’t noise—investigate it.
Report adjusted metrics alongside raw ones.

For practitioners:

A model scoring 78% might actually be performing at 83%+ against “true” labels.
High inter-model agreement on “wrong” answers suggests label issues, not model failures.
The model that thinks differently isn’t necessarily worse—test whether it wins its disagreements.

For everyone evaluating AI:

Ground truth is a model, not an oracle.
Self-consistency is measurable and meaningful.
When multiple independent systems converge on an answer your labels reject, consider that they might be right.

The Meta-Lesson

I set out to benchmark LLM classification performance. I ended up learning something more fundamental: that “correct” in classification tasks is surprisingly slippery.

Opus agrees with itself 99.86% of the time. Other models agree with Opus about 78% of the time. They agree with each other about 82% of the time.

The gap between 78% and 99.86% isn’t model failure. It’s the space where reasonable systems learned different classification heuristics from their training data. None of them are wrong. They’re just different.

Next time you see a model “underperforming” on a benchmark, ask yourself: what if the benchmark is the outlier? Sometimes the models aren’t wrong. They’re outvoting the ground truth. And maybe that’s worth listening to.

What It Cost

For the budget-conscious: here’s what 1,400 classifications cost on OpenRouter for the frontier models:

Model	Total Cost
GPT-5-mini	$3.62
Claude Haiku 4.5	$2.68
Gemini 3 Flash	$1.11
Grok 4.1 Fast	$0.73
Total	$8.14

The entire frontier model benchmark cost under $10. Gemini 3 Flash offers the best bang-for-buck: 76.8% accuracy at $0.0008 per classification. GPT-5-mini costs 3x more but only gains 3 percentage points.

For comparison: gpt-oss:120b running locally on an RTX 6000 Blackwell Pro QMax costs about $0.075/hour in electricity at 500W system draw. But electricity isn’t the whole story—factor in depreciation and overhead on a $10k workstation and you’re looking at roughly $2/hour total cost. The 1,400-case benchmark took about 52 minutes, so maybe $1.75 total. That’s still cheaper than the cloud models, but there’s a real cost to controlling your own destiny by owning hardware instead of renting by the token.

Worth noting: gpt-oss:120b is the largest open-weight model tested—the frontier models (GPT-5-mini, Grok, Gemini, Claude) don’t publicly disclose their parameter counts, though estimates range from 600B to 1.8 trillion. Yet a 120B model running on a single GPU matched them on accuracy and beat most of them on speed. For domain-specific classification tasks, you may not need the frontier.

There’s another advantage to open-weight models: you can train them. A few hundred examples of domain-specific classifications, some fine-tuning, and that 78% accuracy could climb significantly higher. The frontier models are black boxes—you get what you get. Open-weight models are starting points you can customize for your domain.

And the Opus self-agreement test? $0.00. When I first considered running Opus through OpenRouter, the cost would have been significant—Opus 4.5 isn’t cheap. But then it hit me: Claude Code runs on Opus 4.5. Why pay for API calls when I can just ask it directly?

So it did. All 1,400 classifications happened inline during a Claude Code session, covered by the monthly Claude Max subscription. The AI figured out how to avoid paying for itself.

Benchmark: 1,400 construction specifications across 35 industry-standard categories. Models tested via OpenRouter: GPT-5-mini, Grok 4.1, Gemini 3 Flash, Claude Haiku 4.5, Devstral-2512. Local model: gpt-oss:120b via Ollama. Self-agreement test performed by Claude Opus 4.5 via Claude Code. December 2025.

Appendix: About That Self-Test

A note on methodology: the “self-agreement test” was performed by Claude Opus 4.5 classifying all 1,400 specs during a Claude Code session. No API calls, no token costs—just direct classification using the same enhanced prompt given to other models.

This is both the strength and the limitation of the experiment. The agreement was between two different tasks: first, being given a category and asked to generate an example specification; second, being given that specification and asked what category it best matches. The 99.86% agreement suggests Opus has a highly stable internal model of these classification categories—it can round-trip from category to example and back. But it’s still the same underlying model, so some consistency is expected.

The interesting question isn’t whether Opus agrees with itself (it does). It’s whether that agreement is higher than inter-model agreement (it is—dramatically). That’s the signal worth paying attention to.