Concept
Verification Debt in AI Coding
Last updated: 2026-07-026 min read
Verification debt is the growing gap between how fast AI coding tools generate code and how reliably a team can verify that code — review it, validate it against intent, test it, and stand behind it — before it is merged. Like technical debt, it compounds: every unverified change becomes a foundation someone else builds on.
Contents
Why the term is suddenly everywhere
For most teams, 2025 was the year AI-assisted coding stopped being an experiment. Generation became cheap; verification did not. The term “verification debt” spread through the developer community and reached the mainstream when AWS CTO Werner Vogels used it at AWS re:Invent in December 2025, as reported by ITPro. Around the same time, Sonar’s State of Code survey of more than 1,100 developers put numbers on the gap:
96%
of developers say they do not fully trust AI-generated code to be functionally correct.
Sonar, State of Code Survey48%
say they always check their AI-assisted code before committing it — barely half.
Sonar, State of Code Survey19%
longer: how much more time experienced open-source developers took with AI assistance in METR's randomized trial — while believing they were faster.
METR, RCT (2025)The same Sonar research found that 38% of developers say reviewing AI-generated code takes more effort than reviewing a colleague’s code, and 53% have seen AI produce code that looks correct but isn’t reliable. That combination — near-universal distrust, partial verification, rising review effort — is verification debt accumulating in plain sight.
What verification debt is — and what it isn't
Verification debt is a flow imbalance: code enters the codebase faster than the team’s capacity to verify it. It is not a statement about AI code quality. Even if generated code were right nine times out of ten, a team that cannot tell which nine still carries the debt for all ten.
It differs from technical debt in how it announces itself. Technical debt produces friction you can feel — slow builds, brittle modules, dreaded files. Verification debt produces the opposite: everything looks done. As developer and educator Kevin Browne puts it, it is the AI era’s technical debt — except it breeds false confidence instead of visible friction.
It is also related to, but distinct from, comprehension debt— the O’Reilly Radar term for code nobody on the team fully understands anymore. Comprehension debt asks “do we understand this code?”; verification debt asks “did anyone actually check this change against what we meant to build?”. A team can understand its codebase and still merge unverified changes all day.
Where verification debt comes from
Five mechanisms do most of the damage:
- Generation outruns review capacity. One developer with an AI tool produces more changed lines per day than a senior engineer can critically audit.
- Large diffs invite skimming. AI-generated pull requests tend to be big and plausible-looking, which is exactly the kind of change human reviewers skim rather than scrutinize.
- The generator grades its own homework. When the same model writes the code, writes the tests, and summarizes the change, there is no independent check anywhere in the loop.
- Intent is missing at review time. Reviewers see what changed but not the task boundaries the change was supposed to respect — so “is this even the right change?” goes unasked.
- Verification evidence is nobody’s artifact. What was tested, what was skipped, and what remains uncertain usually lives in a chat scrollback, if anywhere — invisible to the reviewer and gone in a week.
How to measure it in your team
There is no standard metric yet — the honest way to measure verification debt today is directional, from signals you already have:
- Generation-to-verification ratio: changed lines merged per week vs. reviewer time actually spent. If generation doubled and review time didn’t, the difference is debt.
- Review depth trend: substantive review comments per 100 changed lines, over time. A falling curve while AI-assisted volume rises means reviews are thinning, not improving.
- Unverified-merge rate: share of AI-assisted changes merged without any human-verified test evidence attached.
- “Looked right, wasn’t” incidents: defects traced back to changes that passed review. Sonar found 53% of developers have seen exactly this failure mode with AI code.
- Rework rate: how often AI-assisted changes are reverted, hot-fixed, or rewritten within 30 days.
None of these requires new tooling to start — a spreadsheet and one honest retro per month will surface the trend.
How teams reduce verification debt
The teams that handle this well change what arrives at review rather than reviewing harder:
- Define the task before the run — scope, affected files, and a validation plan written down before the AI generates anything, so there is something concrete to verify against.
- Keep diffs small — a reviewable unit of change beats an impressive one.
- Separate generation from verification — the model that wrote the change should not be the only thing that checked it.
- Require evidence per change — what was tested, what was not, what is still uncertain, attached to the change itself instead of buried in a chat log.
- Keep a human approval gate — no auto-commit; a person accepts the change, with the evidence in front of them.
- Stay local-first where code is sensitive — verification should not require uploading source code to another cloud service.
Where Reality Graph fits
Reality Graph is a local-first verification layer that works beside the AI coding tools a team already uses — Claude Code, Cursor, GitHub Copilot and similar. It applies the practices above as a workflow: task boundaries and context before the run, visible validation and an evidence report after it, with a human approval gate in between. It is currently in private beta; early access is open for a small group of teams.
What it does
- Structures the task, scope, and validation plan before an AI coding run
- Keeps source code in your environment — local-first by design
- Produces a reviewable evidence report per run: intent, changes, validation, open questions
- Keeps a human approval gate — advisory by default, no auto-commit
What it does not do
- Replace Claude Code, Cursor, or Copilot — it works beside them
- Write or commit code on its own
- Claim benchmark numbers or guaranteed savings — no public claims without linked evidence
- Replace your reviewers, tests, or CI — it feeds them better input
FAQ
- What is verification debt in AI coding?
- Verification debt is the growing gap between how fast AI coding tools generate code and how reliably a team can verify that code — review it, validate it against the original intent, test it, and stand behind it — before it is merged. Every change that ships without that verification adds to the debt.
- How is verification debt different from technical debt?
- Technical debt usually announces itself: slow builds, tangled modules, painful changes. Verification debt is quieter — the code looks finished and often works, so it breeds false confidence. The cost surfaces later, when unverified changes become the foundation for further changes and nobody can say with confidence what was actually checked.
- Why doesn't human code review scale with AI coding?
- Because generation got faster and review did not. A single developer with an AI coding tool can produce more changed lines per day than a senior reviewer can critically audit, and large AI-generated pull requests invite skimming instead of scrutiny. Adding more reviewers helps less than changing what arrives at review: smaller diffs, explicit intent, and evidence of what was already validated.
- How do you verify AI-generated code before merge?
- Practically: define the task and its boundaries before the run, keep the diff small, check the change against the stated intent (not just for correctness in isolation), run validation the generating model did not write itself where feasible, and require the change to arrive with evidence — what was tested, what was not, what remains uncertain. A human stays the final gate.
- Does verification debt mean teams should use less AI?
- Not necessarily. It means verification capacity has to grow with generation speed. Teams that pair AI coding tools with explicit task boundaries, independent validation, and evidence per change can generate quickly and still know what they shipped. The debt comes from skipping verification, not from using AI.
- Who coined the term verification debt?
- The idea grew out of the developer community during 2025 as AI-assisted coding became mainstream. It reached a wide audience when AWS CTO Werner Vogels used the term at AWS re:Invent in December 2025, and research such as Sonar's State of Code survey quantified the gap it describes.
Read next
Sources
- Sonar — State of Code Developer Survey: the verification gap in AI coding (2026)
- ITPro — Nearly half of developers don't check AI-generated code (verification debt, Vogels at re:Invent)
- METR — Measuring the impact of early-2025 AI on experienced open-source developer productivity (RCT)
- Kevin Browne — Verification debt is the AI era's technical debt
- LeadDev — You can't verify all the AI-generated code
- O'Reilly Radar — Comprehension debt: the hidden cost of AI-generated code