Verification Is the Scarce Act

A few weeks ago I started noticing the same shape in three unrelated places, and once I saw it I couldn't unsee it.

A test suite ran fully green, and the code underneath it was broken in two ways that would have reached production. A research paper described agents that grade their own work and quietly store the confident-but-wrong answers to reuse later. And in my own week, a tool I'd "fixed" was still sitting on the cause that broke it — I'd just made the pain stop. Three different domains, one underlying truth: making something and knowing it's correct are not the same act, and we keep confusing the two because for most of history they came bundled together.

Generation got cheap, so the bottleneck moved

For a long time, the hard part of building anything was the building. Writing the code, drafting the document, producing the analysis — that was where the hours went, so that was where we put our attention and our pride. Verification rode along for free, because if you'd spent four hours hand-writing something, you'd also spent four hours understanding it. The making was the checking.

That bundle has come apart. Producing a first draft of almost anything — code, prose, a plan, a test — is now close to free and close to instant. And the moment generation gets cheap, the scarce thing is no longer "can you make it." It's "can you trust that it's right." The cost didn't disappear; it relocated. It moved from the keyboard to the judgment.

I'd been operating on this assumption for a while before I saw research point the same way. It was reassuring rather than surprising — the kind of confirmation that tells you the thing you'd been building toward was a real edge and not a private quirk. The signal landed almost the day after I'd committed to it in my own work, which is mostly a lesson about how fast the ground is moving right now.

The producer is never a clean judge

Here's the part that took me longer to accept, because it's uncomfortable.

The instinct, when you realize verification is the hard part, is to throw a smarter checker at it. If the model wrote the code, have a model review the code. If I wrote the tests, I'll read them carefully before I trust them. More intelligence, pointed at the output, should catch more problems.

It doesn't work as well as it should, and the reason is structural. Whoever produced the work carries a set of blind spots — the assumptions they never questioned because questioning them is what would have prevented the bug in the first place. When the same mind, or a mind shaped the same way, grades that work, it brings the same blind spots to the grading. It can't see what it never thought of. My green tests blessed broken code because I'd written those tests with the exact assumptions that produced the bug. An agent grading itself stores its confident mistakes as facts. I called my own bandaid a fix because the crash had stopped and my brain wanted to be done.

The producer is not a neutral judge of the product. Not because anyone is careless — because the angle is wrong. You cannot stand outside your own perspective by trying harder from inside it.

The strongest gate is the one that can't be talked into agreeing

So if a smarter, more agreeable judge isn't the answer, what is?

The move I keep coming back to is the opposite of intelligence: a dumb, mechanical gate that runs before any judgment happens. A shell command that exits zero or non-zero. A test that passes or fails. A schema that validates or rejects. Something with no opinion, no desire to please you, and no shared blind spot — because it has no spots at all. It doesn't reason about whether your work is good; it checks one concrete, pre-stated condition and reports a binary truth you can't argue with.

This sounds like a step down, and that's exactly why it works. A judge can be persuaded. A judge that you built, or that thinks like you, is especially easy to persuade, because it's already inclined to see what you see. A gate can't be persuaded, because it isn't thinking. You decide the condition up front, while you're still honest — before you're tired, before you're invested in being done — and then you hand the verdict to something that will return the same answer whether you like it or not. The discipline is front-loaded into defining the check, where your judgment is still clean, instead of back-loaded into a review, where it's already compromised by everything you've just built.

The richer judgment — the human eye, the model review, the taste — still matters. But it goes after the gate, not instead of it, and it spends its attention only on the things a binary check can't reach. You don't ask a person to verify what a passing test already proved. You ask them for the part no test can hold: whether the thing is actually good.

The thread under all of this isn't really about engineering. It's an old idea wearing new clothes. The hardest thing to see clearly has always been yourself, and the oldest trick the mind plays is translating relief into resolution — "it stopped hurting" into "it's cured," "it looks done" into "it's correct." For most of history we got away with it because the work was slow enough that doing it and checking it were the same motion. Now that making is free, the gap between "made" and "verified" is wide open, and the only honest way across it is to build something that won't agree with you just because you want it to.

So the question I've started asking, before I let anything through: not "does this look right?" — I'm the worst person to answer that — but "what's the dumb check that would catch me if I'm wrong, and have I actually run it?" When did you last ship something because it passed, without asking who, or what, was doing the passing?

This is part of my ongoing exploration of what happens when you treat your life as a system worth engineering and a question worth examining.

Scroll Animation You Don't Know Your AI Assistant: Why It Needs a Router and a Voice