A client recently asked us to settle a debate inside their team: should they build their domain-specific classifier on top of Claude or a GPT-class model. We told them the truthful answer — it depends, and the only way to know is to test on your data — and then sat with them to build the eval harness that would actually settle the question.
This note documents the rubric we used. It’s not novel. It’s just the rubric we keep ending up with after enough engagements.
The Trap
Most teams either:
- Trust the published benchmarks, pick the model with the highest MMLU score, and ship.
- Run a hand-wavy “vibes” comparison on five examples each and pick the one that “feels better.”
Both are wrong. Benchmarks measure capabilities on tasks that may have nothing to do with your domain. Vibes comparisons have no statistical power.
The Rubric
We score on three axes, weighted by what the production system actually cares about:
- Precision and recall on a labelled hold-out set of 500+ examples drawn from real production traffic
- Latency — p50 and p95 — under the actual concurrency the system will see
- Hallucination rate on a curated adversarial set of out-of-distribution inputs
For this engagement the weights were 50/30/20. Different engagements get different weights. A real-time customer-facing system might weight latency at 50%. A batch internal tool might weight it at 5%.
The Harness
The harness itself is straightforward. Pseudo-code:
for model in candidates:
for example in eval_set:
result = model.classify(example.input)
record(
model=model.name,
example_id=example.id,
predicted=result.label,
actual=example.label,
latency_ms=result.latency_ms,
)
score = weighted_score(
precision=precision(results),
recall=recall(results),
p95_latency=p95(results.latency_ms),
hallucination_rate=oods_check(results),
weights=(0.5, 0.3, 0.2),
)
What we want from a harness is repeatability — run it again next month after model versions update, and get a comparable answer — and auditability — every prediction logged, every score reproducible.
The Result
For this specific client, on this specific task, Claude won by a small margin. But the margin was small enough that we explicitly recommended they re-run the harness every quarter. Model behaviour drifts. Your production traffic drifts. The right answer last quarter is not necessarily the right answer this quarter.
The harness is the deliverable. The choice of model is just the first output it produces.