FIELD NOTES · AI RELIABILITY

Passing evals isn’t surviving production.

Writing on the gap between the eval that passes and the system that fails under real load, and on why, as the work shifts from building to orchestrating, verification is the one step you cannot hand to the loop. Every claim ships with a receipt you can clone and rerun.

Read the writing
invoice-extraction · REPORT.md
exact match per shift_type — the slice is the whole story
delimiter
EM 1.000 · 187 recs
whitespace
EM 1.000 · 158 recs
reorder
EM 0.000 · 36 recs
verbose
EM 0.000 · 8 recs
compact
EM 0.000 · 6 recs
mixed
EM 0.000 · 5 recs
order-preserving · survivesfield-moving · silently wrong
1$ git clone github.com/ByteStack-Labs/agent-reliability-receiptsClone the public receipts repo
2$ python3 receipts/invoice-extraction/verify.pyRe-derive every number from the raw committed data
3[OK] production exact_match: 0.862586.25% on production vs 100% on eval: a 13.75-point drop
4[OK] field-moving fraction == error rate: 0.137555/400 field-moving records = exactly the error rate
5[OK] silent (well-formed-but-wrong) count: 4949 wrong records with no detectable signal
6All checks reproduced.  EXIT=0Exits non-zero if any single number fails to reproduce
7$ _