‹ All writing
·3 min

The Receipt Is the Argument

Why every claim we make ships with code you can run.

A number in a slide deck is a claim. A number you can regenerate from the raw data on your own machine is a fact. The distance between those two is the whole discipline.

Most reliability work fails at exactly this seam. The analysis is sound, the chart is clean, and none of it can be rerun. So the reader is left to trust the author. Trust is the thing we are trying to remove from the loop, not add to it.

Our rule is narrow and absolute: every figure we publish reproduces from runnable code, or it does not ship. Not the headline number, not the footnote. The verification script exits non-zero if a single value fails to reproduce, which means the author does not get a pass either. The first time we ran ours, it caught us.

1$ git clone github.com/ByteStack-Labs/agent-reliability-receiptsClone the public receipts repo
2$ python3 receipts/invoice-extraction/verify.pyRe-derive every number from the raw committed data
3[OK] production exact_match: 0.862586.25% on production vs 100% on eval: a 13.75-point drop
4[OK] field-moving fraction == error rate: 0.137555/400 field-moving records = exactly the error rate
5[OK] silent (well-formed-but-wrong) count: 4949 wrong records with no detectable signal
6All checks reproduced.  EXIT=0Exits non-zero if any single number fails to reproduce
7$ _

This is what we mean by verified, not asserted. Evals tell you the output is good. A receipt tells you whether that result held, and lets anyone check. Judge where you must. Verify where you can.

Run the autopsy on your own system. agent-reliability is a free, open Claude Code plugin that reproduces the eval-to-production gap, quantifies it by slice, and surfaces well-formed-but-wrong failures, where they exist, that an accuracy score hides. No engagement required.

The tool: github.com/ByteStack-Labs/claude-plugins
The proof: github.com/ByteStack-Labs/agent-reliability-receipts

If it surfaces a silent failure you would rather prove and fix before it ships, that is the work we do. Bring us the receipt and we run the full Production ML Autopsy: reproduce the failure, prove the root cause, and hand you a report where every number reruns. Book a Production ML Autopsy →

Jesse Moses is the Founder & Chief Architect of ByteStack Labs, a production-reliability firm for AI and ML systems. ByteStack Labs offers Diagnostic, Architecture & Engineering, and Advisory engagements at bytestacklabs.com.