How do you measure verification debt concretely?

With four metrics computed from data you already have: the generation-to-verification ratio (merged changed lines vs. reviewer hours actually spent), review depth (substantive comments per 100 changed lines), the unverified-merge rate (share of AI-assisted merges without recorded validation evidence), and two-week churn (share of new code revised within 14 days). Track them monthly; the trend matters more than any single value.

Which metric should a team start with?

The unverified-merge rate, because it is the most direct expression of the debt: every merge without recorded verification is an unverified claim in production. It only requires one convention - that verification outcomes are recorded per change - and it immediately shows whether your process changes have traction.

Are there standard thresholds?

No standard exists yet - the field is too young. The thresholds in this article are starting points derived from published baselines, like GitClear's measured rise of two-week churn from about 3% to almost 6% across 211 million changed lines. Calibrate against your own three-month baseline rather than someone else's absolute number.

Do I need special tooling to measure this?

No. Git and your PR platform hold everything except verification records: changed lines, review comments, revision timestamps. A spreadsheet and a monthly retro are enough to start. Tooling helps with the one metric git cannot see - whether a change was actually verified - which is why evidence reports pay off twice.

Aren't DORA metrics enough?

DORA measures delivery performance, and AI lifts exactly those numbers - deployment frequency and lead time improve while the risk moves elsewhere. The 2025 DORA report added rework rate as a fifth metric for precisely this blind spot, and industry telemetry shows median PR review time up 441% as AI volume grows. Verification debt metrics complement DORA; they do not replace it.

Can these metrics be used to evaluate individual developers?

They should not be. Verification debt is a property of the system - task definitions, review capacity, tooling - not of individuals, and turning the metrics into personal scores invites gaming that destroys their diagnostic value. Measure the pipeline, discuss trends in the retro, and change the process rather than the ranking.

Method

Measuring Verification Debt

Last updated: 2026-07-024 min read

You measure verification debt as the gap between how much code a team merges and how much of it is demonstrably verified – tracked with four computable metrics: generation-to-verification ratio, review depth, unverified-merge rate, and two-week churn. All four come from git and PR data; the trend over months matters more than any single reading.

Contents

Why your current dashboards don't show the debt

Delivery metrics improve while the debt grows – that is what makes it invisible. AI adoption lifts deployment frequency and lead time, and the risk moves into dimensions the classic dashboards were never built to watch: the 2025 DORA report added rework rate as a fifth metric for exactly this blind spot, with industry telemetry showing median PR review time up 441% and PR size up 51% as AI volume grows.

This article turns the five warning signals from the verification debt overview into something you can put in a spreadsheet this week: formulas, data sources, and honest starting thresholds.

The four metrics, with formulas

Metric	Formula	Data source	Warning threshold (starting point)
Generation-to-verification ratio (GVR)	Merged changed lines per week ÷ reviewer hours actually spent	git log --stat; calendar or PR timestamps	Ratio doubles while reviewer hours stay flat
Review depth	Substantive review comments ÷ (changed lines ÷ 100)	PR platform API (exclude bot and nitpick comments)	Falling below ~1 comment per 100 lines while AI volume rises
Unverified-merge rate (UMR)	AI-assisted merges without recorded validation evidence ÷ all AI-assisted merges	Verification records / evidence reports per change	Above 30% – half of Sonar's respondents don't always check at all
Two-week churn	Lines revised or reverted ≤ 14 days after merge ÷ new lines	git log with follow-up commit analysis	Above ~6% – GitClear measured the industry drifting from 3.1% to 5.7%

Four verification debt metrics computable from git and PR data. The thresholds are starting points, not standards - calibrate against your own three-month baseline.

Two published anchors help you place your numbers: GitClear’s analysis of 211 million changed lines shows two-week churn rising from 3.1% (2020) toward 5.7% – with copy-pasted lines overtaking refactored lines for the first time – and Sonar’s survey puts the share of developers who always verify AI code at 48% – which makes a naive team-level UMR of ~50% a realistic, sobering default.

A worked example

A fictional eight-person team, one month of data – labeled as an example, but the arithmetic is the arithmetic:

vd-report-april.md

Example – illustrative numbers

Team:       8 devs · ~60% of merges AI-assisted
Window:     4 weeks

Inputs
  merged changed lines:        41,200   (was ~19,000/mo pre-AI)
  reviewer hours (calendar):   64 h     (unchanged pre-AI)
  substantive PR comments:     212
  AI-assisted merges:          97 · with recorded evidence: 31
  lines revised ≤14 days:      2,890

Metrics
  GVR:            41,200 / 64  = 644 lines/reviewer-hour
                  (pre-AI baseline: 297)  → 2.2x  ⚠
  Review depth:   212 / 412    = 0.51 per 100 lines  ⚠
  UMR:            (97-31)/97   = 68%                 ⚠⚠
  2-week churn:   2,890/41,200 = 7.0%                ⚠

Reading: generation doubled, verification capacity didn't.
Priority: record verification per change (UMR is the lever
that moves first when the process changes).

Measurement pitfalls

Goodhart’s law. The moment a metric becomes a target, it stops measuring. These four are diagnostic instruments for the monthly retro – not OKRs, and never individual performance scores.
Attribution noise. Cleanly separating AI-assisted from human changes is hard; a simple PR label set by the author is imperfect but beats guessing. Consistency matters more than precision.
Small numbers.A three-person team’s monthly churn jumps around. Widen the window before drawing conclusions.
Measuring without changing anything. The metrics only pay off with a lever attached – written intent per task, a spec-vs-implementation check, recorded outcomes. Otherwise it is a dashboard of decline.

Where Reality Graph fits

The metric git cannot compute is whether a change was actually verified. Reality Graph produces that record as a by-product: every run ends in an evidence report, so the unverified-merge rate becomes a query instead of an archaeology project – and the other three metrics gain the denominator they need.

Measuring gives you

A trend line instead of a feeling that review is drowning
An argument budget-holders understand
Early warning before churn reaches production
A way to see whether process changes actually work

It does not

Come with industry-standard thresholds – calibrate locally
Attribute AI vs. human changes perfectly
Stay honest if used for individual performance reviews
Reduce the debt by itself – it points at the levers

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

How do you measure verification debt concretely?: With four metrics computed from data you already have: the generation-to-verification ratio (merged changed lines vs. reviewer hours actually spent), review depth (substantive comments per 100 changed lines), the unverified-merge rate (share of AI-assisted merges without recorded validation evidence), and two-week churn (share of new code revised within 14 days). Track them monthly; the trend matters more than any single value.
Which metric should a team start with?: The unverified-merge rate, because it is the most direct expression of the debt: every merge without recorded verification is an unverified claim in production. It only requires one convention - that verification outcomes are recorded per change - and it immediately shows whether your process changes have traction.
Are there standard thresholds?: No standard exists yet - the field is too young. The thresholds in this article are starting points derived from published baselines, like GitClear's measured rise of two-week churn from about 3% to almost 6% across 211 million changed lines. Calibrate against your own three-month baseline rather than someone else's absolute number.
Do I need special tooling to measure this?: No. Git and your PR platform hold everything except verification records: changed lines, review comments, revision timestamps. A spreadsheet and a monthly retro are enough to start. Tooling helps with the one metric git cannot see - whether a change was actually verified - which is why evidence reports pay off twice.
Aren't DORA metrics enough?: DORA measures delivery performance, and AI lifts exactly those numbers - deployment frequency and lead time improve while the risk moves elsewhere. The 2025 DORA report added rework rate as a fifth metric for precisely this blind spot, and industry telemetry shows median PR review time up 441% as AI volume grows. Verification debt metrics complement DORA; they do not replace it.
Can these metrics be used to evaluate individual developers?: They should not be. Verification debt is a property of the system - task definitions, review capacity, tooling - not of individuals, and turning the metrics into personal scores invites gaming that destroys their diagnostic value. Measure the pipeline, discuss trends in the retro, and change the process rather than the ranking.

Keep reading

MethodSpec-Driven DevelopmentSpec first, then code: how SDD works with AI agents, what Spec Kit and Kiro actually deliver, and the honest limits practitioners report - including why specs still need verification.MethodTwo-Pass Review WorkflowMachine pre-check first, human architecture review second: what belongs in each pass, how to keep the machine gate high-precision, and why the human always makes the merge call.All articlesThe whole collection – 30 cited, dated guides on verifying AI-generated code.