FIELD NOTES · AI RELIABILITY

Passing evals isn’t surviving production.

Writing on the gap between the eval that passes and the system that fails under real load, and on why, as the work shifts from building to orchestrating, verification is the one step you cannot hand to the loop. Every claim ships with a receipt you can clone and rerun.

Read the writing

invoice-extraction · REPORT.md

exact match per shift_type — the slice is the whole story

delimiter
EM 1.000 · 187 recs

whitespace
EM 1.000 · 158 recs

reorder
EM 0.000 · 36 recs

verbose
EM 0.000 · 8 recs

compact
EM 0.000 · 6 recs

mixed
EM 0.000 · 5 recs

order-preserving · survivesfield-moving · silently wrong

1$ git clone github.com/ByteStack-Labs/agent-reliability-receiptsClone the public receipts repo

2$ python3 receipts/invoice-extraction/verify.pyRe-derive every number from the raw committed data

3[OK] production exact_match: 0.862586.25% on production vs 100% on eval: a 13.75-point drop

4[OK] field-moving fraction == error rate: 0.137555/400 field-moving records = exactly the error rate

5[OK] silent (well-formed-but-wrong) count: 4949 wrong records with no detectable signal

6All checks reproduced. EXIT=0Exits non-zero if any single number fails to reproduce

7$ _

Latest

All writing ›

Jun 24, 2026·5 min

Your Loop Says Done. That Is a Claim, Not a Proof.

Loop engineering moved the work from prompting the agent to designing the system that prompts it. It also moved the risk. When the loop runs while you sleep, “done” becomes the most expensive word in the system.

Read→

Jun 24, 2026·5 min

Your Eval Passed. Production Didn’t.

An accuracy score is an average, and the average is where the failure hides. One public autopsy, every number rerunnable.

Read→

Jun 19, 2026·3 min

The Receipt Is the Argument

Why every claim we make ships with code you can run.

Read→

Mar 27, 2026·8 min

Comprehension Debt

The silent cost of AI-assisted engineering.

Read→