What is verification debt?

Verification debt is the growing gap between how fast AI tools generate code and how reliably a team verifies that code before merge. It accumulates silently while throughput metrics look excellent, and surfaces later as rework, incidents and lost trust in the codebase.

How big is the verification gap, in numbers?

Sonar's 2026 State of Code survey measured it directly: 96% of developers distrust AI-generated code, but only 48% consistently verify it. The gap is behavioral rather than technical - the checks exist, they get skipped under speed pressure.

Why is code review suddenly the bottleneck?

Because generation got cheap and reading did not. Telemetry across high-AI teams shows nearly twice the merged pull requests with review time per PR up 91% - the delivery constraint moved from writing code to verifying it, and review processes built for human pace absorb the shock.

Is AI-generated code actually worse than human code?

It fails differently rather than uniformly worse: hallucinated APIs, silent edge-case errors, scope creep, self-confirming tests and plausible-but-wrong logic are its characteristic classes. Security testing puts numbers on it - roughly 45% of AI-generated samples failed security tests in Veracode's 2025 analysis - and classic review misses several of these classes structurally.

How do you verify AI-generated code systematically?

With a loop around every run: a written task with goal, boundaries and acceptance criteria before generation; a comparison of the produced change against that task afterwards; validation the model did not author (tests, types, build); and a human decision before anything reaches a shared branch. The agent's own summary is a starting point, never the verification.

What is the difference between code review and verification?

Review judges the quality of a diff - bugs, style, security patterns - against general standards. Verification checks a change against its written task: scope, criteria, intent. A change can be flawless as code and still do the wrong thing; only verification catches that, because the reference lives outside the diff.

What is a machine-checkable specification?

A task written so its fulfillment can be checked mechanically: a goal, explicit boundaries for what may be touched, yes/no acceptance criteria, and a validation plan. 'Handles empty input by returning an empty list' is checkable; 'handles edge cases gracefully' is a hope.

How do I write a good task for an AI agent?

Three minutes of structure beat thirty minutes of prompt prose: state the goal in one sentence, list the files or areas the agent may touch, write two to five binary acceptance criteria, and name the checks that will validate the result. This single habit improves the generation and makes the output verifiable at once.

Should AI agents be allowed to commit on their own?

Not to shared, durable state - protected branches, production systems, databases. Inside disposable sandboxes and local working branches, auto-apply is fine and useful. The 2025 Replit incident, where an agent deleted a production database during an explicit code freeze, made the principle concrete: instructions are probabilistic, permissions are deterministic.

Is it enough when the AI reviews its own code?

It is a useful pre-filter and not a verification verdict. Research shows LLM evaluators recognize and favor their own generations, so a model reviewing its own output checks its blind spots with the same blind spots. Independence rises with distance: a fresh pass, a different model, and above all a different reference - the written task.

What is an evidence report?

A per-run record of what was intended, what changed, what was validated with which results, what was deliberately skipped and what remains uncertain - stored with the code. It turns 'trust me, I checked' into a document a reviewer or auditor can read.

How do you prove to an auditor what an AI tool did?

With per-change records answering five questions: which tool and version acted, on what written mandate, what changed, what was validated with which result, and who approved. Git natively answers only the middle question - the rest needs an evidence practice, which is cheap as a byproduct and impossible as archaeology.

How do you measure verification debt?

Four metrics computable from git and PR data: the generation-to-verification ratio, review depth (attention per changed line), the unverified-merge rate, and two-week churn. Throughput metrics stay green while all four deteriorate - that is why teams watching only velocity get surprised.

What does unverified AI code cost?

The clearest published signal is churn: GitClear's analysis of 211 million changed lines shows code reworked within two weeks of merge rising toward 5.7% as AI assistance grows, roughly double the pre-AI baseline. Skipped verification returns later as scheduled work - with interest, and usually attributed to other causes.

Is vibe coding acceptable for professional teams?

As a prototyping mode, often yes; as a production default, the data says no. The trade is explicit: maximum generation speed against minimal verification, and the deferred bill arrives as churn, security findings around the 45-55% failure band, and fix cycles. The dividing line is not whether AI writes the code but whether anything checks it before it carries production risk.

What data do AI coding tools send to their providers?

Depending on tool and configuration: the edited file, repository context the tool selects, pasted logs and tickets, telemetry - and with codebase-indexing tools, a representation of the whole repository. Secrets amplify the risk: AI-assisted commits leak credentials at roughly twice the baseline rate per GitGuardian's data.

How do you use AI coding tools in a GDPR-compliant way?

Treat it as a data-flow problem: map what leaves your environment, locate personal data in it (fixtures, tickets, logs - rarely the code itself), minimize before transmitting, and close the formal side - an Art. 28 processing agreement, a transfer mechanism for non-EU processing, documented training opt-outs. The final assessment belongs to your data protection officer.

Does the EU AI Act regulate AI-generated code?

No - the Act regulates AI systems, not the ordinary software they help write. Teams using coding assistants are deployers with a small duty set (AI literacy, softened by the 2026 omnibus), and the high-risk deadlines moved to December 2027 and August 2028. The duty to verify generated code lives elsewhere: product liability, sector rules and your own risk.

Who is liable when AI-generated code causes damage?

There is no special AI liability regime - the EU withdrew that directive in 2025. The ordinary layers apply: contract and, from December 2026, the new Product Liability Directive treating software as a product; internally, management oversight duties that NIS2 sharpened. No published court case on AI-generated code exists yet - documented verification is the defense posture either way.

Are AI coding tools compatible with ISO 27001 or TISAX?

In principle yes - neither standard bans a tool category. Both demand the tools live inside your management system: inventoried, risk-assessed, under supplier control, with data classification honored and generated code verified. Certified organizations fail audits on undocumented AI usage, not on AI usage.

May regulated industries use AI coding tools at all?

Generally yes, because DORA, IEC 62304/MDR and ISO 26262/ASPICE regulate the process and its evidence, not the authorship of code. The same verification, traceability and documented lifecycle apply whether a human or an assistant wrote the change - AI raises the volume, which makes the evidence side harder and more important.

Can you run code review with a local LLM?

Yes - workable from about 8GB of VRAM with 7B-class coder models, serious at 24GB where quantized 32B-class models run. What makes smaller models punch above their weight is structure: a written task as the review reference plus deterministic checks, which lose nothing offline. The honest trade is peak insight for a hard data boundary.

Does AI code verification work without internet access?

Yes - verification is the offline-friendly half of AI coding. The deterministic layer (build, types, tests, scanners) is offline-native, the written task is a file in the repository, and the judgment layer runs on a local model. What an air gap constrains is the generator, not the checking.

Which AI code review tools exist in 2026?

Four groups: dedicated PR reviewers (CodeRabbit, Greptile, Qodo), static+AI platforms (DeepSource, SonarQube with AI Code Assurance), assistant-integrated reviewers (Copilot code review, Cursor Bugbot, Claude Code), and local approaches (open-source PR-Agent, local models, verification layers). There is no universal best - the right pick depends on data constraints and on which question needs answering: diff quality or task conformance.

Does one verification process work across multiple AI tools?

Yes, if the invariants live one layer above the tools: a written task per run, a change-against-task check, validation the model did not author, evidence per run, and a human gate. These five steps mention no tool, which is why they survive tool switches and cover the agent someone adopted yesterday.

FAQ

The AI Code Verification FAQ

Last updated: 2026-07-028 min read

These are the 25 most common questions about AI code verification – concepts, methods, law, tooling – each answered in two to four self-contained sentences consistent with the deep-dive articles behind them. Statistics cited here carry their year; sources and full arguments are one link away in the respective article.

Contents

How this FAQ is organized

The 25 questions below follow the arc most teams walk: what the problem is (verification debt, the gap, the bottleneck), how to work differently (tasks, verification, evidence), what the rules say (GDPR, AI Act, liability, certifications), and what the tooling landscape offers (local models, offline, reviewers, multi-tool). Every answer stands alone by design – quote freely, and follow the map below into the deep dives.

Questions	Cluster	Deep dives
1-4	The problem, measured	Concepts category
5-10	Working differently	Methods + Governance categories
11-15	Evidence and economics	Evidence + Cost categories
16-21	Rules and certifications	Compliance category
22-25	Tooling and architecture	Local-first + Comparisons + Works-with

The question clusters and where their deep dives live - the FAQ answers in four sentences what the articles argue in fourteen hundred words.

Two companions to this page: the glossary defines the terms these answers use, and the article hub holds every deep dive by category. Like the glossary, this FAQ is a living reference – new categories add their questions here in the same change.

Where Reality Graph fits

The recurring answer pattern above - written task, verification against it, evidence, human gate - is what Reality Graph mechanizes, local-first. The FAQ stays deliberately wider than the product: most answers hold with or without any particular tool, which is exactly why they are safe to quote.

This FAQ gives you

25 self-contained answers, each quotable alone
Consistency with the sourced deep-dive articles
The cluster map into the full article base
A living reference that grows with new categories

It does not give you

The full arguments - those live in the linked articles
Legal advice - the compliance answers are descriptive
Vendor verdicts - see the comparison category
Product documentation - this page is about the sector

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What is verification debt?: Verification debt is the growing gap between how fast AI tools generate code and how reliably a team verifies that code before merge. It accumulates silently while throughput metrics look excellent, and surfaces later as rework, incidents and lost trust in the codebase.
How big is the verification gap, in numbers?: Sonar's 2026 State of Code survey measured it directly: 96% of developers distrust AI-generated code, but only 48% consistently verify it. The gap is behavioral rather than technical - the checks exist, they get skipped under speed pressure.
Why is code review suddenly the bottleneck?: Because generation got cheap and reading did not. Telemetry across high-AI teams shows nearly twice the merged pull requests with review time per PR up 91% - the delivery constraint moved from writing code to verifying it, and review processes built for human pace absorb the shock.
Is AI-generated code actually worse than human code?: It fails differently rather than uniformly worse: hallucinated APIs, silent edge-case errors, scope creep, self-confirming tests and plausible-but-wrong logic are its characteristic classes. Security testing puts numbers on it - roughly 45% of AI-generated samples failed security tests in Veracode's 2025 analysis - and classic review misses several of these classes structurally.
How do you verify AI-generated code systematically?: With a loop around every run: a written task with goal, boundaries and acceptance criteria before generation; a comparison of the produced change against that task afterwards; validation the model did not author (tests, types, build); and a human decision before anything reaches a shared branch. The agent's own summary is a starting point, never the verification.
What is the difference between code review and verification?: Review judges the quality of a diff - bugs, style, security patterns - against general standards. Verification checks a change against its written task: scope, criteria, intent. A change can be flawless as code and still do the wrong thing; only verification catches that, because the reference lives outside the diff.
What is a machine-checkable specification?: A task written so its fulfillment can be checked mechanically: a goal, explicit boundaries for what may be touched, yes/no acceptance criteria, and a validation plan. 'Handles empty input by returning an empty list' is checkable; 'handles edge cases gracefully' is a hope.
How do I write a good task for an AI agent?: Three minutes of structure beat thirty minutes of prompt prose: state the goal in one sentence, list the files or areas the agent may touch, write two to five binary acceptance criteria, and name the checks that will validate the result. This single habit improves the generation and makes the output verifiable at once.
Should AI agents be allowed to commit on their own?: Not to shared, durable state - protected branches, production systems, databases. Inside disposable sandboxes and local working branches, auto-apply is fine and useful. The 2025 Replit incident, where an agent deleted a production database during an explicit code freeze, made the principle concrete: instructions are probabilistic, permissions are deterministic.
Is it enough when the AI reviews its own code?: It is a useful pre-filter and not a verification verdict. Research shows LLM evaluators recognize and favor their own generations, so a model reviewing its own output checks its blind spots with the same blind spots. Independence rises with distance: a fresh pass, a different model, and above all a different reference - the written task.
What is an evidence report?: A per-run record of what was intended, what changed, what was validated with which results, what was deliberately skipped and what remains uncertain - stored with the code. It turns 'trust me, I checked' into a document a reviewer or auditor can read.
How do you prove to an auditor what an AI tool did?: With per-change records answering five questions: which tool and version acted, on what written mandate, what changed, what was validated with which result, and who approved. Git natively answers only the middle question - the rest needs an evidence practice, which is cheap as a byproduct and impossible as archaeology.
How do you measure verification debt?: Four metrics computable from git and PR data: the generation-to-verification ratio, review depth (attention per changed line), the unverified-merge rate, and two-week churn. Throughput metrics stay green while all four deteriorate - that is why teams watching only velocity get surprised.
What does unverified AI code cost?: The clearest published signal is churn: GitClear's analysis of 211 million changed lines shows code reworked within two weeks of merge rising toward 5.7% as AI assistance grows, roughly double the pre-AI baseline. Skipped verification returns later as scheduled work - with interest, and usually attributed to other causes.
Is vibe coding acceptable for professional teams?: As a prototyping mode, often yes; as a production default, the data says no. The trade is explicit: maximum generation speed against minimal verification, and the deferred bill arrives as churn, security findings around the 45-55% failure band, and fix cycles. The dividing line is not whether AI writes the code but whether anything checks it before it carries production risk.
What data do AI coding tools send to their providers?: Depending on tool and configuration: the edited file, repository context the tool selects, pasted logs and tickets, telemetry - and with codebase-indexing tools, a representation of the whole repository. Secrets amplify the risk: AI-assisted commits leak credentials at roughly twice the baseline rate per GitGuardian's data.
How do you use AI coding tools in a GDPR-compliant way?: Treat it as a data-flow problem: map what leaves your environment, locate personal data in it (fixtures, tickets, logs - rarely the code itself), minimize before transmitting, and close the formal side - an Art. 28 processing agreement, a transfer mechanism for non-EU processing, documented training opt-outs. The final assessment belongs to your data protection officer.
Does the EU AI Act regulate AI-generated code?: No - the Act regulates AI systems, not the ordinary software they help write. Teams using coding assistants are deployers with a small duty set (AI literacy, softened by the 2026 omnibus), and the high-risk deadlines moved to December 2027 and August 2028. The duty to verify generated code lives elsewhere: product liability, sector rules and your own risk.
Who is liable when AI-generated code causes damage?: There is no special AI liability regime - the EU withdrew that directive in 2025. The ordinary layers apply: contract and, from December 2026, the new Product Liability Directive treating software as a product; internally, management oversight duties that NIS2 sharpened. No published court case on AI-generated code exists yet - documented verification is the defense posture either way.
Are AI coding tools compatible with ISO 27001 or TISAX?: In principle yes - neither standard bans a tool category. Both demand the tools live inside your management system: inventoried, risk-assessed, under supplier control, with data classification honored and generated code verified. Certified organizations fail audits on undocumented AI usage, not on AI usage.
May regulated industries use AI coding tools at all?: Generally yes, because DORA, IEC 62304/MDR and ISO 26262/ASPICE regulate the process and its evidence, not the authorship of code. The same verification, traceability and documented lifecycle apply whether a human or an assistant wrote the change - AI raises the volume, which makes the evidence side harder and more important.
Can you run code review with a local LLM?: Yes - workable from about 8GB of VRAM with 7B-class coder models, serious at 24GB where quantized 32B-class models run. What makes smaller models punch above their weight is structure: a written task as the review reference plus deterministic checks, which lose nothing offline. The honest trade is peak insight for a hard data boundary.
Does AI code verification work without internet access?: Yes - verification is the offline-friendly half of AI coding. The deterministic layer (build, types, tests, scanners) is offline-native, the written task is a file in the repository, and the judgment layer runs on a local model. What an air gap constrains is the generator, not the checking.
Which AI code review tools exist in 2026?: Four groups: dedicated PR reviewers (CodeRabbit, Greptile, Qodo), static+AI platforms (DeepSource, SonarQube with AI Code Assurance), assistant-integrated reviewers (Copilot code review, Cursor Bugbot, Claude Code), and local approaches (open-source PR-Agent, local models, verification layers). There is no universal best - the right pick depends on data constraints and on which question needs answering: diff quality or task conformance.
Does one verification process work across multiple AI tools?: Yes, if the invariants live one layer above the tools: a written task per run, a change-against-task check, validation the model did not author, evidence per run, and a human gate. These five steps mention no tool, which is why they survive tool switches and cover the agent someone adopted yesterday.

Keep reading

ProofProof-Carrying CodingNecula's 1997 architecture applied to AI changes: untrusted producers attach evidence receivers can check cheaply. The honest analogy (evidence, not formal proofs), the verification asymmetry, the 2026 agent revival.TrustRebuilding Trust in AI Pull RequestsGut feel fails on AI code - no author to model, plausibility decoupled from correctness, perception skewed. The five-question clear-conscience checklist, and why evidence makes trust transferable.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.