Method
Measuring Verification Debt
Last updated: 2026-07-024 min read
You measure verification debt as the gap between how much code a team merges and how much of it is demonstrably verified – tracked with four computable metrics: generation-to-verification ratio, review depth, unverified-merge rate, and two-week churn. All four come from git and PR data; the trend over months matters more than any single reading.
Contents
Why your current dashboards don't show the debt
Delivery metrics improve while the debt grows – that is what makes it invisible. AI adoption lifts deployment frequency and lead time, and the risk moves into dimensions the classic dashboards were never built to watch: the 2025 DORA report added rework rate as a fifth metric for exactly this blind spot, with industry telemetry showing median PR review time up 441% and PR size up 51% as AI volume grows.
This article turns the five warning signals from the verification debt overview into something you can put in a spreadsheet this week: formulas, data sources, and honest starting thresholds.
The four metrics, with formulas
| Metric | Formula | Data source | Warning threshold (starting point) |
|---|---|---|---|
| Generation-to-verification ratio (GVR) | Merged changed lines per week ÷ reviewer hours actually spent | git log --stat; calendar or PR timestamps | Ratio doubles while reviewer hours stay flat |
| Review depth | Substantive review comments ÷ (changed lines ÷ 100) | PR platform API (exclude bot and nitpick comments) | Falling below ~1 comment per 100 lines while AI volume rises |
| Unverified-merge rate (UMR) | AI-assisted merges without recorded validation evidence ÷ all AI-assisted merges | Verification records / evidence reports per change | Above 30% – half of Sonar's respondents don't always check at all |
| Two-week churn | Lines revised or reverted ≤ 14 days after merge ÷ new lines | git log with follow-up commit analysis | Above ~6% – GitClear measured the industry drifting from 3.1% to 5.7% |
Two published anchors help you place your numbers: GitClear’s analysis of 211 million changed lines shows two-week churn rising from 3.1% (2020) toward 5.7% – with copy-pasted lines overtaking refactored lines for the first time – and Sonar’s survey puts the share of developers who always verify AI code at 48% – which makes a naive team-level UMR of ~50% a realistic, sobering default.
A worked example
A fictional eight-person team, one month of data – labeled as an example, but the arithmetic is the arithmetic:
vd-report-april.md
Example – illustrative numbersTeam: 8 devs · ~60% of merges AI-assisted
Window: 4 weeks
Inputs
merged changed lines: 41,200 (was ~19,000/mo pre-AI)
reviewer hours (calendar): 64 h (unchanged pre-AI)
substantive PR comments: 212
AI-assisted merges: 97 · with recorded evidence: 31
lines revised ≤14 days: 2,890
Metrics
GVR: 41,200 / 64 = 644 lines/reviewer-hour
(pre-AI baseline: 297) → 2.2x ⚠
Review depth: 212 / 412 = 0.51 per 100 lines ⚠
UMR: (97-31)/97 = 68% ⚠⚠
2-week churn: 2,890/41,200 = 7.0% ⚠
Reading: generation doubled, verification capacity didn't.
Priority: record verification per change (UMR is the lever
that moves first when the process changes).Measurement pitfalls
- Goodhart’s law. The moment a metric becomes a target, it stops measuring. These four are diagnostic instruments for the monthly retro – not OKRs, and never individual performance scores.
- Attribution noise. Cleanly separating AI-assisted from human changes is hard; a simple PR label set by the author is imperfect but beats guessing. Consistency matters more than precision.
- Small numbers.A three-person team’s monthly churn jumps around. Widen the window before drawing conclusions.
- Measuring without changing anything. The metrics only pay off with a lever attached – written intent per task, a spec-vs-implementation check, recorded outcomes. Otherwise it is a dashboard of decline.
Where Reality Graph fits
The metric git cannot compute is whether a change was actually verified. Reality Graph produces that record as a by-product: every run ends in an evidence report, so the unverified-merge rate becomes a query instead of an archaeology project – and the other three metrics gain the denominator they need.
Measuring gives you
- A trend line instead of a feeling that review is drowning
- An argument budget-holders understand
- Early warning before churn reaches production
- A way to see whether process changes actually work
It does not
- Come with industry-standard thresholds – calibrate locally
- Attribute AI vs. human changes perfectly
- Stay honest if used for individual performance reviews
- Reduce the debt by itself – it points at the levers
If these boundaries fit how your team wants to ship:
FAQ
- How do you measure verification debt concretely?
- With four metrics computed from data you already have: the generation-to-verification ratio (merged changed lines vs. reviewer hours actually spent), review depth (substantive comments per 100 changed lines), the unverified-merge rate (share of AI-assisted merges without recorded validation evidence), and two-week churn (share of new code revised within 14 days). Track them monthly; the trend matters more than any single value.
- Which metric should a team start with?
- The unverified-merge rate, because it is the most direct expression of the debt: every merge without recorded verification is an unverified claim in production. It only requires one convention - that verification outcomes are recorded per change - and it immediately shows whether your process changes have traction.
- Are there standard thresholds?
- No standard exists yet - the field is too young. The thresholds in this article are starting points derived from published baselines, like GitClear's measured rise of two-week churn from about 3% to almost 6% across 211 million changed lines. Calibrate against your own three-month baseline rather than someone else's absolute number.
- Do I need special tooling to measure this?
- No. Git and your PR platform hold everything except verification records: changed lines, review comments, revision timestamps. A spreadsheet and a monthly retro are enough to start. Tooling helps with the one metric git cannot see - whether a change was actually verified - which is why evidence reports pay off twice.
- Aren't DORA metrics enough?
- DORA measures delivery performance, and AI lifts exactly those numbers - deployment frequency and lead time improve while the risk moves elsewhere. The 2025 DORA report added rework rate as a fifth metric for precisely this blind spot, and industry telemetry shows median PR review time up 441% as AI volume grows. Verification debt metrics complement DORA; they do not replace it.
- Can these metrics be used to evaluate individual developers?
- They should not be. Verification debt is a property of the system - task definitions, review capacity, tooling - not of individuals, and turning the metrics into personal scores invites gaming that destroys their diagnostic value. Measure the pipeline, discuss trends in the retro, and change the process rather than the ranking.
Keep reading
Sources
- GitClear – AI Copilot Code Quality: 211M changed lines, churn and duplication trends 2020–2024 (2025)
- Faros AI – Key takeaways from the DORA Report 2025: rework rate, review-time inflation (2025)
- Sonar – State of Code Developer Survey: the 96/48 verification gap (2026)
- Faros AI telemetry (10,000+ developers): ~98% more merged PRs, review time +91% – summarized in 'The State of AI Code Review in 2026' (2026)