How do you verify AI-generated code systematically?

Write down what the change is supposed to do before the run – goal, boundaries, acceptance criteria. After the run, compare the diff against that written intent instead of against your memory of the prompt, run validation the model did not author, and record the result. The written spec is what makes the check systematic rather than a gut feeling.

Isn't that just testing?

Tests are one instrument inside the check, not the check itself. A test suite verifies behavior the test author thought of – and when the same model writes code and tests, both share the same assumptions. The spec-vs-implementation check adds the missing reference point: an intent statement that exists independently of the model, against which code, tests, and scope are all compared.

Do I need formal methods for this?

No. Formal verification is the strongest version of the idea and an active research field, but the practical method works with plain language: a goal sentence, hard boundaries, and three to seven checkable acceptance criteria. Teams get most of the value from writing intent down at all – rigor can grow later where the risk justifies it.

How detailed does the spec need to be?

Detailed enough that a colleague could decide 'done or not done' without asking you. In practice that is a few lines: one goal, the boundaries (files and behavior that must not change), and acceptance criteria phrased so each one is a yes/no check. Specs that try to describe every implementation detail age badly and stop being read.

What if the code already exists and there was never a spec?

Write the spec at review time instead of before the run: state what you believe the change is supposed to do, then check the diff against that statement. It feels backwards, but it still breaks the circularity – you are no longer asking the code to explain itself. For the next run, write the intent first.

Does a spec-vs-implementation check slow teams down?

It moves minutes to the front of the task. Writing the intent takes five minutes; reviewing a diff against written intent is usually faster than reconstructing what the change was supposed to do from the diff itself. What actually slows teams down is the rework loop when 'looks right' code turns out wrong – 61% of developers report exactly that pattern with AI code.

Method

The Spec-vs-Implementation Check

Last updated: 2026-07-025 min read

A spec-vs-implementation check verifies AI-generated code against a written statement of intent – goal, boundaries, acceptance criteria – instead of against the reviewer’s memory of the prompt. It turns “looks right” into a yes/no comparison and gives review an external reference the generating model cannot influence.

Contents

Why 'looks right' stopped being enough

The numbers describe a strange equilibrium. In Sonar’s 2026 State of Code survey, 96% of developers say they do not fully trust that AI-generated code is functionally correct – yet only 48% always check it before committing, and 61% report AI code that looks correct but is not reliable. AI already accounts for roughly 42% of committed code in the same survey. Distrust without a method becomes resignation.

The failure mode is specific. AI code rarely fails loudly; it fails in the gap between what the requirement meant and what the model understood it to mean. Reviewing the diff cannot catch that gap, because the diff only shows what was built – not what was asked. The missing artifact is the written intent, and every merge without it adds to the debt.

The circularity problem: a model checking itself

The obvious fix – let the AI test and review its own output – has a structural flaw: both sides of the comparison come from the same assumptions. Tests written by the generating model verify what the model assumed the function should do, which is subtly different from what it should do. Recent research on AI-assisted review draws the same conclusion: a specification is the quality gate precisely because it introduces a reference that is independent of the model’s training distribution – without it, review checks code against itself.

How large the self-assessment gap can get is documented: one team measured 449 AI code reviews and found the system rated itself 98.6% valid while independent validation put it at 69% – a thirty-point gap between confidence and reality. The conclusion is not “never use AI review”; it is: separate the reference from the generator.

The check in five steps

Tool-agnostic – it works with Claude Code, Cursor, Copilot, or any agent:

Write the Soll (target state) before the run. One goal sentence, the boundaries (files and behavior that must not change), and 3–7 acceptance criteria. Five minutes, plain language.
Make every criterion checkable. “Returns a correct Retry-After header on 429 responses” is checkable; “improves rate limiting” is not. If you cannot phrase a yes/no question, sharpen the criterion.
Let the tool run unchanged. The method adds nothing between you and the agent – same prompts, same speed.
Compare the Ist (actual result) against the Soll. Walk the diff criterion by criterion. Off-scope changes are findings even when they look like improvements – scope creep is where silent regressions hide.
Record the outcome. Which criteria passed, what was skipped, what stays open – a few lines stored with the change, so the next person is not re-deriving your check. That record is what an evidence report formalizes.

The whole artifact is small enough to read at a glance:

soll-ist-check.md

Example – not real run data

SOLL (written before the run)
Goal:      Return a correct Retry-After header on 429 responses
Boundary:  api/middleware/* only · rate-limit logic untouched
Criteria:  [1] header present on every 429   [2] value = window rest
           [3] no change to 2xx/4xx paths    [4] unit tests pass

IST (checked after the run)
[1] PASS   [2] PASS   [3] FAIL – 403 handler was also edited (off-scope)
[4] PASS (61 green, 3 new – written before the run)

Verdict:   REJECTED → re-run with boundary reminder · re-check [3]

Limits and typical mistakes

Spec theater. A spec nobody compares against is ritual, not verification. The check is the point – the document is just its input.
Over-specification. Describing the implementation instead of the intent produces specs that age badly and get skipped. Specify what must be true, not how to build it.
It cannot catch what the spec missed. The method verifies the stated intent. Requirements you did not think of stay unverified – this is why research treats intent formalization as a grand challenge, not a solved problem.
Skipping the record. An unrecorded check evaporates. The team cannot tell a verified merge from a lucky one a month later.

Where Reality Graph fits

Everything above works manually with a text file and discipline. Reality Graph systematizes exactly this method: the Soll becomes a structured task with boundaries, the Ist is checked against it after the run, and the outcome lands in a reviewable report – as part of the broader verification loop.

The method gives you

An external reference the generating model cannot influence
A yes/no comparison instead of 'looks right'
Scope-creep detection as a side effect of boundaries
A recorded outcome the next reviewer can build on

It does not

Catch requirements the spec never stated
Replace tests, types, or security review
Require formal methods – plain language works
Depend on any specific AI coding tool

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

How do you verify AI-generated code systematically?: Write down what the change is supposed to do before the run – goal, boundaries, acceptance criteria. After the run, compare the diff against that written intent instead of against your memory of the prompt, run validation the model did not author, and record the result. The written spec is what makes the check systematic rather than a gut feeling.
Isn't that just testing?: Tests are one instrument inside the check, not the check itself. A test suite verifies behavior the test author thought of – and when the same model writes code and tests, both share the same assumptions. The spec-vs-implementation check adds the missing reference point: an intent statement that exists independently of the model, against which code, tests, and scope are all compared.
Do I need formal methods for this?: No. Formal verification is the strongest version of the idea and an active research field, but the practical method works with plain language: a goal sentence, hard boundaries, and three to seven checkable acceptance criteria. Teams get most of the value from writing intent down at all – rigor can grow later where the risk justifies it.
How detailed does the spec need to be?: Detailed enough that a colleague could decide 'done or not done' without asking you. In practice that is a few lines: one goal, the boundaries (files and behavior that must not change), and acceptance criteria phrased so each one is a yes/no check. Specs that try to describe every implementation detail age badly and stop being read.
What if the code already exists and there was never a spec?: Write the spec at review time instead of before the run: state what you believe the change is supposed to do, then check the diff against that statement. It feels backwards, but it still breaks the circularity – you are no longer asking the code to explain itself. For the next run, write the intent first.
Does a spec-vs-implementation check slow teams down?: It moves minutes to the front of the task. Writing the intent takes five minutes; reviewing a diff against written intent is usually faster than reconstructing what the change was supposed to do from the diff itself. What actually slows teams down is the rework loop when 'looks right' code turns out wrong – 61% of developers report exactly that pattern with AI code.

Keep reading

MethodCode Review vs. VerificationReview judges quality, verification checks a change against written intent - why AI speed broke review-only workflows, what the data shows, and the division of labor that works.MethodMachine-Checkable SpecificationsTurn prompts into verifiable tasks: goal, boundaries, yes/no acceptance criteria, validation plan - the four building blocks and the rules that make criteria checkable.All articlesThe whole collection – 30 cited, dated guides on verifying AI-generated code.