‹ All writing
·5 min

Your Loop Says Done. That Is a Claim, Not a Proof.

Loop engineering moved the work from prompting the agent to designing the system that prompts it. It also moved the risk. When the loop runs while you sleep, “done” becomes the most expensive word in the system.

For about two years, getting work out of a coding agent meant writing a good prompt and handing it enough context. That is ending, and the idea that ended it is loop engineering, named by Addy Osmani in June 2026 and pushed by Peter Steinberger and by Boris Cherny at Anthropic. The shift reduces to one sentence: you stop prompting the agent and start designing the loop that prompts it. The loop finds the work, hands it to a sub-agent, checks the result, writes down what is done, and decides the next thing. You build it once. It runs on a timer. You are no longer inside the loop. You are outside it, building it.

This is real, and it is probably where coding agents are going. The building blocks ship inside the products now, in Claude Code and in Codex, so a loop is no longer a pile of bash that only you understand. I am not here to argue against it. I am here to point at the one part of it that does not get cheaper as the loop gets better, and in fact gets more dangerous.

The part the loop cannot automate

Read the careful versions of the loop-engineering argument, including Osmani’s own, and they all arrive at the same wall. Verification stays on you. A loop running unattended is a loop making mistakes unattended. There are two standard answers, and they are not the same thing. One is to add a judge: a separate model that reads the output and returns a verdict, the evaluator-and-optimizer pattern in Anthropic’s own playbook. The other is to let the agent verify its own work, which is strongest when it is grounded in what Anthropic calls ground truth from the environment, real tool results and code execution rather than an opinion. Both help. Neither closes the gap, for two different reasons. A judge’s verdict is a model’s opinion about quality, probabilistic and quietly biased toward what a model produced; an agent that checks its own work authors its own tests and grades its own coverage, so it can only tell you the output passed the checks it already thought to write. When either one says pass, you have a second opinion or a self-graded test. You do not have a proof.

That is the whole problem in four words. Done is a claim. When you were in the loop, prompting line by line, a wrong step cost you a minute and you watched it happen. Now the loop runs while you sleep, and a wrong “done” does not cost a minute. It gets committed, the ticket gets closed, the next run builds on top of it, and by the time anyone looks, the mistake is three layers down and load-bearing.

The faster the loop ships code you did not write, the wider the gap between what your repo contains and what you understand is in it. I have called that gap comprehension debt; the loop-engineering writers are arriving at the same idea from the other side, noting that good loops grow it faster, not slower. Either way the conclusion is the same. Loop engineering does not remove judgment from the work. It concentrates it. It makes generation nearly free and leaves judgment as the one scarce thing, which means the entire value of a loop now hangs on a single question: when it says done, is that true, and how would you know.

A receipt is how you would know

There is a difference between a claim and a receipt, and it is the difference between a loop you hope is working and one you can prove is. A claim is a number in a report, or a checker sub-agent’s green check. A receipt is a number you can regenerate yourself, from the raw inputs, on your own machine, with a script that fails loudly the moment the number does not reproduce. One asks for your trust. The other does not need it. The separation is clean: a judge answers is this good, a self-test answers did it pass the checks I wrote, and a receipt answers something narrower and harder, does this exact number reproduce from the raw data, for anyone, every time.

This matters most for the failure an unattended loop is built to miss. The dangerous failure is not the loud one. It is the output that is well-formed and wrong: it parses, it has the right shape, it throws no exception and returns no null, and it is simply incorrect. A checker sub-agent inspecting structure passes it, because the structure is fine. A test suite written against the happy path passes it. The loop marks it done and moves on, because nothing in the loop is looking at the one thing that is wrong, which is the content.

I walked a public, rerunnable example of exactly this in another piece: an extractor that scores a perfect 100% on evaluation and then, on production-realistic inputs, fails on a clean sliceA slice is a group of cases that share a trait, like every invoice longer than a page. of cases, every one of them, with most of the wrong answers well-formed enough that no error monitor would flag them. You can clone it and reproduce every number with no model and no GPU. That is what a receipt is for. The point was never that the extractor failed. The point is that you did not have to believe me.

The checker your loop is actually missing

Your loop has a maker. It probably has a checker. What it does not have is a checker whose pass you can trust without trusting it, and that is a specific, buildable thing: a verification step that does not grade the output but re-derives the result from the raw data and exits non-zero the moment a number fails to hold.

That is what agent-reliability is. It is a free, open Claude Code plugin, and the cleanest way to place it is as the verifier half of your loop, the one that emits a receipt instead of an opinion. Today that is a verification you run on what your loop produces: point it at the system your loop touches and it runs the autopsy, where the eval distribution stops matching production, which slice carries the failure, and how much of your error is silent rather than loud. Wiring it inside the loop as a gate that fires on every iteration is the direction, not yet the claim. Either way it hands you something you can rerun, which is the only kind of “done” that survives contact with a loop running while you are asleep.

Be clear about what this is not. It does not replace your judge or your self-test, and it does not promise to catch every failure. It is the layer underneath them: the way you would check that your judge is calibrated, or that your agent’s “done” is true. The judgment of where to look, which slice, which metric, which input counts as production, stays yours. The receipt does not remove that judgment; it makes it reproducible instead of asserted, which is the one thing a verdict and a self-test structurally cannot do.

Loop engineering is right that the leverage has moved. It moved to the loop. But the value moved somewhere more specific: to the one step in the loop you cannot take on faith. Build the loop like someone who intends to stay the engineer, and the verification step is where you stay.

Run the autopsy on your own system. agent-reliability is a free, open Claude Code plugin that reproduces the eval-to-production gap, quantifies it by slice, and surfaces well-formed-but-wrong failures, where they exist, that an accuracy score hides. No engagement required.

The tool: github.com/ByteStack-Labs/claude-plugins
The proof: github.com/ByteStack-Labs/agent-reliability-receipts

If it surfaces a silent failure you would rather prove and fix before it ships, that is the work we do. Bring us the receipt and we run the full Production ML Autopsy: reproduce the failure, prove the root cause, and hand you a report where every number reruns. Book a Production ML Autopsy →

Jesse Moses is the Founder & Chief Architect of ByteStack Labs, a production-reliability firm for AI and ML systems. ByteStack Labs offers Diagnostic, Architecture & Engineering, and Advisory engagements at bytestacklabs.com.