Accept or reject? Five heuristics that beat 'looks right'

Published 2026-05-11 by Owner

The hardest part of AI-assisted coding isn’t writing prompts. It’s deciding whether to accept what comes back.

The default review heuristic is “looks right.” This fails at a predictable rate. The model produces plausible code. The code passes a visual scan. The code ships. The bug surfaces two days later in production when someone passes an empty array to the function that wasn’t guarded for it.

The failure isn’t the AI — the AI generated the code it was going to generate. The failure is the review. A vague review criterion (“looks right”) is not a heuristic; it’s the absence of one. Without a specific thing to check, the eye slides over subtle problems. The structure looks familiar. The names look reasonable. The logic looks sound, at first glance.

Five concrete checks replace “looks right” with something that actually catches failures.

The five checks

1. Read the test — does it actually test the behavior?

When an AI adds a test alongside new code, read the assertion, not just the test name.

A common failure: the test calls the function with valid input and asserts the output is truthy. That passes for any implementation that returns anything. The test name says “should process items correctly” and it does pass — it just doesn’t falsify anything useful.

What you want to see: an assertion against a specific value, not just a non-null result.

// weak — passes even if the function is broken in subtle ways
expect(result).toBeTruthy();

// strong — fails if the output is wrong in any specific way
expect(result).toEqual([{ id: 1, label: 'first' }]);

If the model wrote the first kind of test, either strengthen it yourself or ask the model to write a more specific assertion. Weak tests are worse than no tests — they create false confidence.

2. Check the edge case — what if input is empty, null, or huge?

Models write for the happy path. The happy path is the version of the function where inputs are well-formed, non-empty, and roughly the expected size. Edge cases require more deliberate thinking.

For any function that accepts a collection, ask: what happens if it’s empty? For a function that accepts a string, what happens if it’s null or an empty string? For a function that fans out work, what happens if the input has ten thousand items?

You don’t need to test all of these. You need to read the code and verify the author considered them. Signs that edge cases were not considered:

.length accessed without a null check
Array methods like .map() chained on a value that could be undefined
No limit on iteration over a caller-supplied collection

A useful shortcut: after reading the function, ask “what’s the cheapest input that would break this?” If the answer comes to you quickly and the code doesn’t handle it, reject. If you have to think hard to find a breaking case and it would require unusual input, the edge-case coverage is probably fine.

Spot-checking one edge case — whichever one would hurt most in production — is usually enough to know whether the model thought about this category at all.

3. Watch for mocks — did the AI mock something that should be real?

When the model adds tests, it often adds mocks to keep the test self-contained. That’s fine for external APIs. It’s a problem when the mock substitutes for something that should be exercised.

The tell: the mock returns a hardcoded value, and the test asserts behavior based on that hardcoded value. The actual logic being tested is zero lines deep. If you deleted the real implementation and replaced it with return hardcodedValue, the test would still pass.

This matters most for integration boundaries — functions that combine multiple pieces of logic, database queries, file operations. Mocking those in a unit test is reasonable. Mocking the thing the test is supposed to be testing is not.

Read what the mock returns. Ask whether a test that only sees the mock’s response is actually testing your code.

4. Smell for boilerplate — excess error handling, redundant comments

The model writes confident-looking code. One form this takes is defensive boilerplate: try/catch blocks around code with no clear error contract, console.log statements for debugging, comments that restate what the code already says.

// calculates the total — redundant, the function name already says this
function calculateTotal(items: Item[]): number {
  try {
    // iterate over each item
    return items.reduce((sum, item) => {
      return sum + item.price;
    }, 0);
  } catch (error) {
    console.error('Error calculating total:', error);
    return 0;
  }
}

The try/catch here silently swallows a crash and returns 0. In real code, a crash in .reduce() is almost certainly a bug — swallowing it means the caller gets a wrong answer instead of an error. The comment is noise. If this ships as-is, you own the maintenance of both.

Strip boilerplate before accepting. The accepted version of that function should be three lines.

5. Confirm with a runtime — run it, don’t just read it

Static reading catches many issues. It doesn’t catch timing bugs, environment assumptions, or outputs that are subtly wrong rather than structurally wrong.

Before accepting any non-trivial suggestion, run it. This means:

Execute the test suite for the affected module
Run the function manually in a REPL or test file with realistic input
For UI changes, load the actual page in the browser

“Runs without error” is a low bar. “Produces the expected output for real input” is the actual bar.

The model is confident that its code works. That confidence isn’t evidence. The runtime is evidence.

One quick pattern: keep a scratch file open alongside the AI tool. When a suggestion comes in for a pure function, paste it into the scratch file with three or four representative inputs and run it before accepting. This adds maybe twenty seconds and catches output-is-wrong failures that static review rarely catches.

A false accept that cost real time

Last month, an AI-generated function parsed a list of configuration objects and merged them. The output looked correct on visual review. The tests passed. The code went in.

Three days later, a user reported that their last configuration was being silently dropped. The function iterated with an off-by-one — it was processing items.length - 1 items, not items.length. The final object was never included.

The test had used a single-item list. For a single item, off-by-one against an empty array produces the right answer. The edge case check — “what if there’s more than one?” — would have caught it instantly. Running the function against a three-item fixture before accepting would have caught it in thirty seconds.

The fix took ten minutes. Finding the cause took an hour. The heuristic that would have prevented it takes thirty seconds per suggestion.

The over-rejection trap

There’s a failure mode on the other side. Some developers who know AI makes mistakes start rejecting too much — spending fifteen minutes manually reviewing a twelve-line utility function that the model almost certainly got right.

Over-rejection has real costs. It slows the workflow to the point where the AI provides little advantage. It also causes its own quality problem: decisions made slowly under fatigue are no more reliable than fast decisions made with a checklist.

The five heuristics are a triage system, not a full audit. The goal is to apply each check in under thirty seconds and move on. If something fails a check, investigate — either fix it yourself or send the suggestion back with a specific note (“this doesn’t handle the empty case”). If everything passes, accept and ship.

Spending fifteen minutes auditing a ten-line helper because “AI might be wrong” is a poor use of the time AI is supposed to be saving. Use the checklist. Apply it consistently. Move.

Calibrate suspicion to evidence, not to anxiety. Blanket suspicion isn’t rigor — it’s just slower.

Calibration over a week

The ratio of accepted to rejected suggestions is a signal. Track it loosely for a week and look at what you rejected.

If rejections cluster around one category — say, mocks every time, or missing edge cases every time — that’s a pattern in either how the model is prompted or how the model behaves for your type of work. Adjusting the prompt for that specific gap (adding “handle the empty-array case” or “don’t mock the class under test”) is faster than applying extra vigilance manually every time.

If rejections are scattered with no pattern, the model is performing reasonably well and the five checks are catching genuine outliers. That’s the steady state to aim for.

The over-rejector will find they reject a lot, often for vague reasons, and their ratio doesn’t improve with better prompts. The useful feedback loop requires specificity: what check failed, not just “something felt off.”

Trust speed, verify before merging

The model is faster than a human at first-draft code. That’s its value. The judgment about whether the first draft is correct is not something to outsource.

The division of labor that works: let the AI write fast. Apply the five checks before merging. Read the test. Pick one edge case. Verify the mocks make sense. Strip the boilerplate. Run it.

That’s the whole review for most suggestions — five minutes, not fifteen. The goal isn’t to be suspicious of AI output. It’s to be specific about what you’re checking and efficient about checking it.