Skip to content

Method

Measuring Verification Debt

Last updated: 2026-07-024 min read

You measure verification debt as the gap between how much code a team merges and how much of it is demonstrably verified – tracked with four computable metrics: generation-to-verification ratio, review depth, unverified-merge rate, and two-week churn. All four come from git and PR data; the trend over months matters more than any single reading.

Contents

Why your current dashboards don't show the debt

Delivery metrics improve while the debt grows – that is what makes it invisible. AI adoption lifts deployment frequency and lead time, and the risk moves into dimensions the classic dashboards were never built to watch: the 2025 DORA report added rework rate as a fifth metric for exactly this blind spot, with industry telemetry showing median PR review time up 441% and PR size up 51% as AI volume grows.

This article turns the five warning signals from the verification debt overview into something you can put in a spreadsheet this week: formulas, data sources, and honest starting thresholds.

The four metrics, with formulas

MetricFormulaData sourceWarning threshold (starting point)
Generation-to-verification ratio (GVR)Merged changed lines per week ÷ reviewer hours actually spentgit log --stat; calendar or PR timestampsRatio doubles while reviewer hours stay flat
Review depthSubstantive review comments ÷ (changed lines ÷ 100)PR platform API (exclude bot and nitpick comments)Falling below ~1 comment per 100 lines while AI volume rises
Unverified-merge rate (UMR)AI-assisted merges without recorded validation evidence ÷ all AI-assisted mergesVerification records / evidence reports per changeAbove 30% – half of Sonar's respondents don't always check at all
Two-week churnLines revised or reverted ≤ 14 days after merge ÷ new linesgit log with follow-up commit analysisAbove ~6% – GitClear measured the industry drifting from 3.1% to 5.7%
Four verification debt metrics computable from git and PR data. The thresholds are starting points, not standards - calibrate against your own three-month baseline.

Two published anchors help you place your numbers: GitClear’s analysis of 211 million changed lines shows two-week churn rising from 3.1% (2020) toward 5.7% – with copy-pasted lines overtaking refactored lines for the first time – and Sonar’s survey puts the share of developers who always verify AI code at 48% – which makes a naive team-level UMR of ~50% a realistic, sobering default.

A worked example

A fictional eight-person team, one month of data – labeled as an example, but the arithmetic is the arithmetic:

vd-report-april.md

Example – illustrative numbers
Team:       8 devs · ~60% of merges AI-assisted
Window:     4 weeks

Inputs
  merged changed lines:        41,200   (was ~19,000/mo pre-AI)
  reviewer hours (calendar):   64 h     (unchanged pre-AI)
  substantive PR comments:     212
  AI-assisted merges:          97 · with recorded evidence: 31
  lines revised ≤14 days:      2,890

Metrics
  GVR:            41,200 / 64  = 644 lines/reviewer-hour
                  (pre-AI baseline: 297)  → 2.2x  ⚠
  Review depth:   212 / 412    = 0.51 per 100 lines  ⚠
  UMR:            (97-31)/97   = 68%                 ⚠⚠
  2-week churn:   2,890/41,200 = 7.0%                ⚠

Reading: generation doubled, verification capacity didn't.
Priority: record verification per change (UMR is the lever
that moves first when the process changes).

Measurement pitfalls

  • Goodhart’s law. The moment a metric becomes a target, it stops measuring. These four are diagnostic instruments for the monthly retro – not OKRs, and never individual performance scores.
  • Attribution noise. Cleanly separating AI-assisted from human changes is hard; a simple PR label set by the author is imperfect but beats guessing. Consistency matters more than precision.
  • Small numbers.A three-person team’s monthly churn jumps around. Widen the window before drawing conclusions.
  • Measuring without changing anything. The metrics only pay off with a lever attached – written intent per task, a spec-vs-implementation check, recorded outcomes. Otherwise it is a dashboard of decline.

Where Reality Graph fits

The metric git cannot compute is whether a change was actually verified. Reality Graph produces that record as a by-product: every run ends in an evidence report, so the unverified-merge rate becomes a query instead of an archaeology project – and the other three metrics gain the denominator they need.

Measuring gives you

  • A trend line instead of a feeling that review is drowning
  • An argument budget-holders understand
  • Early warning before churn reaches production
  • A way to see whether process changes actually work

It does not

  • Come with industry-standard thresholds – calibrate locally
  • Attribute AI vs. human changes perfectly
  • Stay honest if used for individual performance reviews
  • Reduce the debt by itself – it points at the levers

If these boundaries fit how your team wants to ship:

FAQ

How do you measure verification debt concretely?
With four metrics computed from data you already have: the generation-to-verification ratio (merged changed lines vs. reviewer hours actually spent), review depth (substantive comments per 100 changed lines), the unverified-merge rate (share of AI-assisted merges without recorded validation evidence), and two-week churn (share of new code revised within 14 days). Track them monthly; the trend matters more than any single value.
Which metric should a team start with?
The unverified-merge rate, because it is the most direct expression of the debt: every merge without recorded verification is an unverified claim in production. It only requires one convention - that verification outcomes are recorded per change - and it immediately shows whether your process changes have traction.
Are there standard thresholds?
No standard exists yet - the field is too young. The thresholds in this article are starting points derived from published baselines, like GitClear's measured rise of two-week churn from about 3% to almost 6% across 211 million changed lines. Calibrate against your own three-month baseline rather than someone else's absolute number.
Do I need special tooling to measure this?
No. Git and your PR platform hold everything except verification records: changed lines, review comments, revision timestamps. A spreadsheet and a monthly retro are enough to start. Tooling helps with the one metric git cannot see - whether a change was actually verified - which is why evidence reports pay off twice.
Aren't DORA metrics enough?
DORA measures delivery performance, and AI lifts exactly those numbers - deployment frequency and lead time improve while the risk moves elsewhere. The 2025 DORA report added rework rate as a fifth metric for precisely this blind spot, and industry telemetry shows median PR review time up 441% as AI volume grows. Verification debt metrics complement DORA; they do not replace it.
Can these metrics be used to evaluate individual developers?
They should not be. Verification debt is a property of the system - task definitions, review capacity, tooling - not of individuals, and turning the metrics into personal scores invites gaming that destroys their diagnostic value. Measure the pipeline, discuss trends in the retro, and change the process rather than the ranking.

Keep reading

Sources

Want to follow the beta, or test it when it opens?

Join early access