Glossary
The AI Coding Verification Glossary
Last updated: 2026-07-028 min read
This AI coding verification glossary defines the sector’s vocabulary in one to three self-contained sentences per term – from verification debt and the review bottleneck through checkable specifications and evidence reports to slopsquatting and the CLOUD Act. Each definition stands alone; each links its deep dive. A living reference, extended as the article base grows.
Contents
The core problem terms
Verification debt. The growing gap between how fast AI tools generate code and how reliably a team verifies it before merge. It accumulates silently while throughput metrics look excellent and surfaces later as rework, incidents and lost trust – definition, data and countermeasures.
Verification gap. The measured distance between distrust and diligence: 96% of developers distrust AI code while only 48% consistently verify it (Sonar, 2026). The gap is behavioral, not technical – the checks exist, they are skipped.
Review bottleneck. The capacity mismatch that appears when AI generates code faster than humans read it: merged PRs nearly double while review time per PR rises 91%. The delivery constraint moved from writing code to verifying it – mechanics and numbers.
Technical debt. Deliberately or negligently deferred code quality that costs interest as future change effort. The classic sibling of the newer debt terms below – distinct because the code works and everyone knows the shortcut was taken.
Comprehension debt. The gap between the code a team ships and the mental model its people hold of it. Rooted in Naur’s theory-building view of programming and accelerated by AI generation – origins and consequences.
Vibe coding. Prompting an AI and shipping the result largely without line-by-line review – coined by Andrej Karpathy in 2025 as a description of prototype-mode work. Rational for throwaway code, expensive as a production default – the measured bill.
Rubber-stamping. Approving changes without genuine scrutiny – review as ritual. The typical failure mode of teams whose review capacity is outrun by AI generation volume; approvals keep flowing, correctness ships on trust.
Code churn. The share of code revised or reverted shortly after merge – commonly measured in a two-week window. GitClear’s 211-million-line analysis shows churn rising toward 5.7% as AI assistance grows – a leading indicator of skipped verification.
Scope creep (AI runs). An agent changing more than the task asked – extra files, unrequested refactors, drive-by fixes. Well-formed code outside the mandate passes quality-only review, which is why boundaries and scope checks exist.
| Term | What is deferred | How it surfaces |
|---|---|---|
| Technical debt | Code quality | Rising change effort |
| Verification debt | Checking | Rework, incidents, lost trust |
| Comprehension debt | Understanding | Nobody can safely change the code |
| Documentation debt | Recording | Onboarding and audits stall |
Method terms
AI coding verification. Checking AI-generated changes against explicit task intent, validation plans and evidence before acceptance – the umbrella method this site describes, step by step.
Spec-vs-implementation check. Verifying a change against a written statement of intent instead of the reviewer’s memory of the prompt. It solves the circularity of judging code by looking only at code – the five steps.
Machine-checkable specification. A task written so its fulfillment can be checked mechanically: goal, boundaries, yes/no acceptance criteria, validation plan – the four building blocks.
Acceptance criteria. Binary conditions a change must satisfy to fulfill its task. “Handles empty input by returning an empty list” is checkable; “handles edge cases well” is not.
Task boundaries. The declared perimeter of an AI run – which files, directories and systems it may touch. Declared before the run, checkable after it; the mechanical answer to scope creep.
Validation plan. The pre-declared list of checks that will decide whether a change is accepted – tests, types, build, spec comparison. Declaring it before generation prevents validating whatever happens to pass.
Two-pass review. A machine pre-check clearing the mechanical layer, followed by human review focused on architecture and trade-offs – what belongs in each pass.
Spec-driven development (SDD). Writing the specification first and generating code from it – tooling like Spec Kit and Kiro popularized the pattern for AI agents. Specs still need verification against the result – honest assessment.
Session handoff. Persisting state, decisions and open verification points in an artifact when one AI session ends and another begins – because context windows forget – method and template.
Measurement terms
Generation-to-verification ratio. AI-assisted lines merged versus lines covered by real verification – the top-level verification-debt metric, computable from git and PR data.
Review depth. Reviewer attention per changed line (comments, time). Falling depth at rising volume is the measurable signature of rubber-stamping.
Unverified-merge rate. The share of changes merged without validation the generating model did not author. The single most direct debt indicator – formulas and thresholds.
Two-week churn. The share of merged code reworked within 14 days – cheap to compute, hard to game, and the lagging confirmation of the three leading metrics above.
Evidence and governance terms
Evidence report. A per-run record of what was intended, changed, validated (with results), skipped and left uncertain – stored with the code – structure and sample.
Audit trail (AI code). The accumulated per-change records answering who/what acted, on what mandate, what changed, what was checked, who approved. Git alone answers only the middle question – the five auditor questions.
Provenance. Knowing where a change came from – which tool, model, version and mandate. The precondition for incident attribution and license/audit answers in AI-assisted codebases.
Human gate. The point where a person reads the evidence and decides whether a change enters shared state. The one step no automation replaces in a sound AI workflow.
AI coding policy. The team document fixing tools, data rules, task rules, verification duties, evidence, agent permissions, exceptions and ownership – full template.
Advisory by default. An agent posture where the tool proposes and never applies unattended – the architectural form of “the human decides”.
No auto-commit. The rule that AI agents never write unattended to shared, durable state (protected branches, production). Least privilege applied to agents – the permission ladder.
Least privilege (agents). Granting an agent only the access its task needs – scoped credentials, no production reach. Deterministic safety that holds when instructions fail.
Independence and failure-mode terms
Independent verification. Checking AI output with an instance that did not produce it – ideally a different model and always a different reference (the written task, not the diff) – the independence ladder.
Self-preference bias. The measured tendency of LLM evaluators to recognize and favor their own generations – the research core of why same-model self-review has systematic blind spots.
LLM-as-a-judge. Using a language model to evaluate outputs (including other models’ code). Useful and scalable, with known biases – self-preference above all – that independence measures counteract.
Cross-model review. Having a different model than the generator review a change. Removes shared blind spots; keeps the diff-as-own-reference limit.
Hallucinated API. A plausible-looking function, parameter or endpoint the model invented. Typed stacks catch many at build time; dynamic ones ship them – the five failure classes.
Package hallucination / slopsquatting. Models inventing dependency names – and attackers registering those names with malicious payloads (slopsquatting). A supply-chain risk unique to generated code; internal mirrors and dependency policies blunt it.
Self-confirming tests. Tests written by the same model that wrote the code, asserting the implemented behavior rather than the required one. They pass while the requirement fails – why validation authorship matters.
Plausible-but-wrong logic. Clean, idiomatic code implementing the wrong rule. Invisible to quality-only review because the defect is relative to the task, which is not in the diff.
Prompt injection (coding agents). Adversarial instructions hidden in material an agent reads (code comments, issues, web content) that redirect its behavior. One more reason agent permissions, not agent obedience, carry the safety.
Architecture and deployment terms
Local-first. An architecture where processing happens inside your environment by design, so no vendor data flow exists to analyze – as distinct from a local process calling cloud APIs – who needs it and how.
Local model. An LLM running on your own hardware – workable for code review from roughly 8GB VRAM, serious at 24GB – tiers and limits.
Air-gapped. An environment with no network connection to the outside. Verification works there natively; generation is constrained to local models – what works offline.
Data boundary. The declared line data must not cross – environment, jurisdiction, provider. The organizing concept of the local-first category.
Codebase index. A tool-maintained representation of your whole repository (embeddings, graphs) enabling context-aware review – and widening the data question from diffs to everything.
Sandboxing (agents). Running agent work in disposable, isolated environments so mistakes cannot reach shared state. Containment for the run – distinct from verification of the result.
Deployer / provider (EU AI Act). The AI Act’s role split: providers place AI systems on the market, deployers use them professionally. Teams using coding assistants are deployers, with a correspondingly small duty set – what applies.
CLOUD Act. The 2018 US law letting authorities compel providers under US jurisdiction to produce data they control, regardless of storage location – the sober analysis.
Zero data retention. A vendor commitment not to store customer inputs beyond processing. Valuable where real – and worth verifying against the vendor’s own documentation, which sometimes says otherwise.
Where Reality Graph fits
Reality Graph operationalizes a specific subset of this vocabulary: machine-checkable tasks with boundaries, spec-vs-implementation verification, evidence reports, advisory-by-default and no auto-commit – local-first. The glossary stays deliberately wider than the product: the vocabulary belongs to the sector, and the FAQ answers the questions these terms raise.
This glossary gives you
- 40+ terms, each defined to stand alone
- Links to the sourced deep dive per term
- The debt-family distinctions people conflate
- A living reference that grows with the article base
It does not give you
- The statistics themselves - cite the linked sources
- Legal definitions - the compliance articles carry those
- Vendor evaluations - see the comparison category
- Canonical authority over contested terms - meanings are noted as used here
If these boundaries fit how your team wants to ship:
FAQ
- What is the most important term to understand first?
- Verification debt - the growing gap between how fast AI tools generate code and how reliably teams verify it before merge. Most other terms in this glossary describe either a cause of that gap (review bottleneck, vibe coding), a way to measure it (generation-to-verification ratio, unverified-merge rate), or a way to close it (spec-vs-implementation check, evidence reports).
- How do the four 'debt' terms differ?
- Technical debt is deliberately deferred code quality. Verification debt is deferred checking - code shipped faster than it was verified. Comprehension debt is deferred understanding - code nobody on the team holds a mental model of. Documentation debt is deferred recording. They compound: unverified code tends to be un-understood code, and both get worse silently.
- Are these standard industry terms or your inventions?
- Mixed, and the glossary marks the difference. Terms like technical debt, vibe coding, slopsquatting, LLM-as-a-judge and prompt injection are established sector vocabulary with known origins. Terms like verification debt and comprehension debt circulate in the practitioner debate without a single canonical source - we define the meaning used across this site and link the sourced deep dives.
- How is this glossary maintained?
- It is a living reference: whenever a new article category ships, new terms are added in the same change and the visible date is updated. Definitions are kept consistent with the deep-dive articles - where a term's article says more, the article wins and the glossary links to it.
- Can I cite these definitions?
- Yes - each definition is written to stand alone, which is the point of the page. For statistics behind terms like the verification gap, cite the underlying source (named in the linked deep dives) rather than the glossary line.