verification-before-completion: stop claiming work is done when it isn't

Published 2026-05-11 by Owner

The most common failure pattern in AI-assisted coding is not bad code. It’s false completion signals. The agent writes something, says “implementation complete”, and the user runs it to find it doesn’t work. The agent apologizes, fixes the obvious issue, says “that should do it now”, and the cycle repeats. Every iteration costs more time than if the agent had just verified before claiming done.

This is the problem the verification-before-completion skill in Claude Code’s Superpowers suite is built to solve. The rule it enforces is blunt: evidence before assertions. If the agent claims tests pass, it must have run the tests and pasted the output. If it claims lint is clean, the lint output has to be there. If it says “fixed”, it must have reproduced the original failure and shown that it no longer happens.

No exceptions for “this looks right to me.”

The skill is part of the Superpowers collection for Claude Code — a set of skills designed to constrain the agent’s behavior in ways that improve reliability. This one is the most mechanical, but it’s also the one that, once active, most immediately changes the character of the output. Sessions feel different. Claims feel heavier. That’s the point.

The two failure modes the skill catches

The first failure mode is the obvious one: the agent never ran the command at all. It generated code that looks correct based on patterns in its training data, reported completion, and moved on. The test suite is irrelevant to what it actually checked. This happens more than it should. The agent isn’t lying exactly — it has strong prior expectations about whether code it just generated will pass tests, and those expectations are often right. But “often right” isn’t “verified”, and the gap between those two things is exactly where the embarrassing failures live.

The second failure mode is subtler and more dangerous: the agent ran the command but ignored output it didn’t recognize as a problem. A test suite runs, one test is marked skip or todo, the overall exit code is zero, and the agent reports “all tests pass.” From a strict reading, that’s true. From a practical reading, the test that would have caught the regression was not running.

A variant: the agent runs the type checker, gets an output block with 40 lines of diagnostic noise it has seen before (unused variable warnings, deprecated API hints), and filters those out as “not real errors” while missing the one type error buried in line 23 that is a real error. The exit code might not even catch it if the config doesn’t treat that category as fatal.

The skill addresses both. It requires the command to run AND the output to be included with the claim. If the output shows anything other than a clean pass, that becomes part of the completion report rather than something to quietly move past.

What the skill actually requires

The mechanics are straightforward. Before any completion claim, the agent must:

Run the relevant verification command
Paste the actual output into the response
Make the claim only if the output supports it

For a code change that touches a module with tests, that looks like:

$ bun run test src/lib/affiliate.test.ts

 ✓ src/lib/affiliate.test.ts (12 tests) 847ms

Test Files  1 passed (1)
Tests       12 passed (12)
Duration    1.02s

Then: “Tests pass — all 12 in the affiliate module.”

Not: “I’ve updated the affiliate logic. The tests should pass.”

The difference between those two things is the difference between a verified claim and a hope dressed up as a claim.

The same applies to linting, type checking, build steps — any gate that exists in the project’s pipeline. If bun run lint:content is a build gate, and the change touches content files, the lint output goes in the response before completion is declared.

There is a corollary about reproducing failures. When a task starts with “this function is broken”, the agent must demonstrate the broken behavior before claiming it fixed it. “I’ve found and addressed the issue” is not verification. The original failure must appear in the response — the stack trace, the failing assertion, the wrong output — followed by the same check after the fix showing it no longer fails. Claim plus evidence, in that order, every time.

How this changes commit messages

The verification requirement has a downstream effect on commit messages that’s worth naming explicitly.

A commit message that says “fix race condition in connection pool” is an assertion. It implies the fix was verified: that the race condition was observable, that a change was made, and that the race condition is no longer observable. Without verification, that message is aspirational. The commit is a hypothesis, not a fix.

With the skill active, a commit message citing a fix should be traceable to verification output in the agent’s response immediately before the commit. If the output isn’t there, the message doesn’t get written. The agent instead writes: “update connection pool locking logic (unverified)” or holds off on committing until it can show the output.

The same logic applies to PR descriptions. A PR that says “fixes #342” implies the fix was confirmed against the reported behavior. An agent that hasn’t reproduced the original failure cannot make that claim. It can say “addresses the scenario described in #342” — weaker, but accurate.

This might sound pedantic. It’s not. The purpose of a commit message is to communicate what actually changed and what it does. When commit messages routinely overstate confidence, they stop being useful. Reviewers learn to distrust them. The git log becomes noise. Verification discipline is how you keep the log honest.

The related effect on code review: a PR where the description includes actual test output (“12 passed, 0 failed, coverage 84%”) is easier to review than one where the description says “added tests.” The reviewer can assess whether the coverage is adequate, whether the number of tests is plausible for the change, whether anything looks like it was skipped. Visible verification output is a form of communication, not just a compliance check.

An incident: lint passed, sort of

A concrete example of why the “show the output” rule matters.

A PR came in with a description that said the content lint step passed. The author had run bun run lint:content and gotten a clean exit. What they hadn’t noticed: the glob in the lint script at the time matched src/content/**.mdx — double-star, not triple-star. It walked one level deep. The new article was in src/content/guides/, which it covered. The fixtures in src/content/guides/subfolder/ were not walked.

The PR landed. The next full build ran astro build, which has its own content loading, and a fixture file in the subfolder had a slop phrase from an earlier draft that never got cleaned up. Build failed in CI.

If the lint output had been included in the PR description, the file count would have been visible: 47 files checked. The person reviewing the PR would have known how many content files existed and could have caught that the count was short. The output, not just the exit code, carries information.

The skill enforces showing output precisely because the exit code is not the whole story.

That particular incident is recoverable — CI catches it, nobody ships broken content. But the same class of problem appears in less forgiving contexts: a security-relevant validation that the agent reports as working, a data migration that the agent reports as completed, a rate-limit fix that the agent reports as verified. In those contexts, “exit code zero” as the sole evidence of correctness is not an adequate standard. The output matters.

What changes in practice

When verification-before-completion is active, the agent’s working loop changes shape. Instead of:

write code → claim done

It becomes:

write code → run verification → show output → claim done (if output supports it)

For short tasks — a one-line fix with a narrow test — this adds maybe 15 seconds. For any task that touches a module with real test coverage, it adds however long the test suite takes to run for that module.

The overhead is real. The alternative is occasional “done” claims that aren’t, which cost far more time to unwind than the verification would have taken. The economics only look unfavorable if you assume the unverified claim was going to be right anyway. In practice, it often isn’t.

A secondary benefit: the verification loop surfaces problems early in the agent’s working context, while the code it just wrote is still fresh. An agent that discovers a test failure immediately after writing the change has full context to fix it. An agent that discovers the same failure three turns later — after the user ran the code themselves — has to reconstruct what it did and why.

A third benefit that’s easy to miss: verification output in the response creates a record. Six weeks from now, when someone asks whether that module had passing tests before the refactor, the answer is in the chat history. Not “I believe so” — actually in the history, with the test names and durations. This matters more on long-lived projects where the conversation log serves as a lightweight audit trail.

Pairing with the broader skill set

verification-before-completion pairs well with test-driven-development (write the failing test first, then the code, then verify the test passes) and with finishing-a-development-branch (pre-commit checklist before merge). In a TDD workflow, verification is already structural: the test exists before the code, so running it to confirm it passes is the natural final step. The skill makes that step mandatory rather than assumed.

The TDD pairing also handles the “fixed, sort of” variant. An agent that writes a test, makes it pass, and shows the output has done real verification. An agent that writes a test, notices that the existing code already passes it, and concludes “the bug must not have existed” has verified nothing useful. The original failure has to be reproduced. If you can’t reproduce it, you can’t confirm it’s gone.

With finishing-a-development-branch, the verification becomes the branch sign-off checklist: tests run, output shown, lint clean, build passes, anything else the project’s pipeline requires. The commit that lands the branch should have those outputs visible in the preceding agent turns. A reviewer looking at the PR should be able to trace from “tests pass” in the description to the actual test output in the session log.

What “obviously correct” actually costs

The resistance to verification usually comes from the “obviously correct” case. The fix is a one-line change; the logic is simple; running the whole test suite for this feels like overhead. This framing is backwards.

The cases that feel obviously correct are the ones where verification provides the most signal per unit of time spent. If the fix is trivially right, running the tests costs 10 seconds and returns confirming output. If the fix turns out to be subtly wrong — which happens more than the “obviously correct” framing suggests — verification catches it immediately rather than at user-run time or CI time.

The cases that feel complex are the cases where the agent is already uncertain, and the output from verification is expected and not particularly diagnostic. The verification loop is doing the least additional work there. It’s the confident completions that benefit most from checking.

The cases where verification discipline matters most are exactly the cases where it’s most tempting to skip it: the “obvious” fix, the “small” refactor, the change that “clearly” couldn’t have broken anything. Those are the changes that tend to fail silently, because their failure modes are precisely the ones no one thought to check.

Evidence before assertions. Show the output. Then claim done.

The agents that are worth trusting are the ones that have demonstrated they check their own work. That demonstration happens through verification output, one session at a time. It’s the cheapest signal of reliability an agent can produce.