‹ All writing
·5 min

Your Eval Passed. Production Didn’t.

An accuracy score is an average, and the average is where the failure hides. One public autopsy, every number rerunnable.

Every model ships with a number that makes everyone comfortable. Ninety-five percent. A hundred percent on the held-out set. The number passes review, the system ships, and then it meets production, where the inputs do not look like the eval set, and the comfortable number quietly stops being true. The problem is not that teams skip evaluation. The problem is that an eval tells you the output was good on the distribution you tested, and says nothing about whether it survives the distribution you did not.

This is not an argument to test more. It is an argument about what a single number conceals, and it is easier to show than to assert. So here is a real one, public, that you can clone and rerun on your own machine with no model and no GPU.

The system that scored 100%

The fixture is an invoice extractor, a deliberately synthetic system built to be diagnosed in the open: text in, structured fields out. On its evaluation set it scores a clean 100% exact matchExact match counts an answer correct only if it equals the expected answer character for character.. Four hundred records, every field correct. By every check a team normally runs, it is done.

Then it meets production input. Not adversarial input, just real input: the same invoices, formatted the way the world actually formats them. Exact match drops to 86.25%. A 13.75-point fall.

Here is where most investigations stop. Eighty-six percent is still a B. You note the regression, add it to the backlog, and move on. That is the mistake, and the aggregate is what causes it.

The aggregate is the lie

The 13.75 points are not spread thinly across the inputs. Break the same production set out by the kind of formatting shift each record underwent, and the average dissolves into something far more disturbing:

  • delimiter shifts: 1.000 exact match, 187 records
  • whitespace shifts: 1.000 exact match, 158 records
  • reorder shifts: 0.000, 36 records
  • verbose shifts: 0.000, 8 records
  • compact shifts: 0.000, 6 records
  • mixed shifts: 0.000, 5 records

This is not a system that got 14% worse. It is two systems wearing one number. Where the shift preserves field order, the extractor is perfect. Where the shift moves fields around, it does not degrade, it fails completely, every single time. The 345 order-preserving records survive at 100%. The 55 field-moving records fail at 100%. Fifty-five out of four hundred is 13.75%: the entire drop, accounted for by one mechanism the average had blended into noise.

A model that is uniformly 86% right and a model that is flawless on most of its inputs and totally broken on a specific 14% are different systems with different risks, and the aggregate cannot tell them apart. The slice is the whole story.

The failure that does not announce itself

It gets worse, and this is the part that should change how you read every eval you trust. Of those 55 wrong records, 49 are well-formed. They parse. They return clean, structured, plausible output. No exception. No null. Nothing a downstream null-check would ever catch. The fields are simply wrong, confidently and silently wrong.

This is the failure mode that ships, because it is invisible to exactly the instruments teams rely on. An accuracy score reports 86% and looks acceptable. An error monitor sees no errors, because nothing errored. A schema validator passes, because the shape is right. The only thing wrong is the content, and content is the one thing those guards never inspect. The system is not failing loudly in a way you will notice. It is failing quietly in a way you will not, until the wrong invoice total has already moved money.

Verified, not asserted

You have no reason to take any of the numbers above on faith, and that is the point. The entire autopsy is a public fixture with a verification script that re-derives every figure from the raw committed data, from scratch, rather than trusting the metrics already written in the report. If a single number fails to reproduce, the script exits non-zero. It holds its own author to the same standard it holds the system under test. The first time it ran, it had to pass before this could be written.

That is the difference between a claim and a receipt. A number in a report is a claim. A number you can regenerate on your own machine is a fact. Evals tell you the output looked good. A receipt tells you whether the result held, and lets anyone check. Judge where you must. Verify where you can.

git clone github.com/ByteStack-Labs/agent-reliability-receipts
python3 receipts/invoice-extraction/verify.py

It re-scores the raw data and reproduces every figure in this piece, or it fails trying.

Run the autopsy on your own system. agent-reliability is a free, open Claude Code plugin that reproduces the eval-to-production gap, quantifies it by slice, and surfaces well-formed-but-wrong failures, where they exist, that an accuracy score hides. No engagement required.

The tool: github.com/ByteStack-Labs/claude-plugins
The proof: github.com/ByteStack-Labs/agent-reliability-receipts

If it surfaces a silent failure you would rather prove and fix before it ships, that is the work we do. Bring us the receipt and we run the full Production ML Autopsy: reproduce the failure, prove the root cause, and hand you a report where every number reruns. Book a Production ML Autopsy →

Jesse Moses is the Founder & Chief Architect of ByteStack Labs, a production-reliability firm for AI and ML systems. ByteStack Labs offers Diagnostic, Architecture & Engineering, and Advisory engagements at bytestacklabs.com.