What does unverified AI code cost a team per year?

There is no honest universal number - the cost depends on volume, rates and incident exposure - but there is honest arithmetic. In our example calculation for a 12-person team (assumptions labeled and swappable), the annual bill lands in the range of one to two full-time engineer salaries, dominated by rework on churned code and review time spent reconstructing intent. The point of the model is not the point value; it is that you can insert your own numbers in ten minutes.

Which of the inputs are measured, and which are assumptions?

Measured, with sources: two-week churn rising toward 5.7% in AI-heavy codebases (GitClear, 211M lines), review time per PR up 91% in high-AI teams (Faros telemetry), and roughly 45% of AI-generated samples failing security tests (Veracode 2025). Assumptions, clearly labeled: your team's output volume, loaded cost per engineer hour, hours per reworked change, and incident frequency. The article's table separates the two classes explicitly.

Isn't churn partly normal and even healthy?

Yes - some churn is fast iteration doing its job, and an honest model does not price all churn as waste. That is why the example prices only the churn delta above the pre-AI baseline (GitClear's ~3.1%) as debt-driven, and why the sensitivity section shows the result under friendlier assumptions. Even the conservative variant stays expensive, which is the finding that matters.

What is the single most expensive line item?

In most parameterizations: review reconstruction - the hour a reviewer spends reverse-engineering what a change was supposed to do, multiplied across every AI-assisted PR. It outweighs rework in teams with high PR volume because it is paid on every change, not just the defective ones. It is also the cheapest line to attack, since a written task per run removes most of it.

Does the calculation include incidents?

As a separate, explicitly uncertain line - incident costs are lumpy and dominated by rare events, so folding them smoothly into an annual average would fake precision. The example carries a modest allowance based on the security-failure rates and flags it as the widest error bar. Teams in regulated or high-exposure domains should model this line with their own incident data, not ours.

What would change the math most?

Two levers, both upstream: written tasks per AI run (collapses the review-reconstruction line and part of the churn delta) and independent validation before merge (moves defects from post-merge rework to pre-merge fixes, which cost a fraction). That is the bridge to the ROI question - when the cost of running those practices undercuts the debt they remove - which the ROI article walks through with the same transparency.

Economics

What Verification Debt Costs

Last updated: 2026-07-025 min read

Verification debt costs real money – it just bills late and under other names: rework on churned code, reviewer hours spent reconstructing intent, and an incident allowance. In our example calculation for a 12-person team – built on sourced inputs, with every assumption labeled and swappable – the annual bill lands in the range of one to two engineer salaries. The number you should trust is the one you get after inserting your own values, which takes ten minutes.

Contents

The sourced inputs the model stands on

Three measured findings anchor the arithmetic. GitClear’s 211-million-line analysis shows two-week churn drifting from a ~3.1% baseline toward 5.7% as AI assistance grows – the delta is the debt-driven share we price. Faros telemetry shows review time per PR up 91% at nearly double the PR volume – the reconstruction tax. And Veracode’s ~45% security-failure rate feeds a deliberately modest incident allowance. Everything else in the model is an assumption you should replace.

The example calculation, line by line

Line	Value in the example	Class
AI-assisted changes merged	~120 PRs/month	Assumption - use your PR data
Churn delta priced as debt	2.5 points (5.6% vs. 3.1% baseline)	Measured trend (GitClear) applied to example volume
Hours per reworked change	6 h average	Assumption - between trivial revert and deep fix
Review reconstruction per AI PR	0.5 h average	Assumption anchored in +91% review-time telemetry
Loaded cost per engineer hour	€75	Assumption - use your finance number
Incident allowance	€20,000/year	Assumption flagged as widest error bar

Example calculation for a 12-person team - the 'Class' column is the honesty mechanism: measured inputs carry sources, assumptions are yours to replace (example, not a benchmark; status July 2026).

The arithmetic, transparently: 120 PRs × 2.5% churn delta ≈ 3 reworked changes per month × 6 h × €75 ≈ €1,350/month in rework. Review reconstruction: 120 PRs × 0.5 h × €75 ≈ €4,500/month – note that this line dwarfs rework, because it is paid on every change, not only the defective ones. Add the incident allowance and the year lands around €90,000–€110,000 – one to two loaded engineer salaries, for a team that ships fast and reads little. Swap any assumption and the model recomputes; the structure is the point, not the point value.

Sensitivity: what flips the result

Halve the churn delta (your codebase behaves better than the GitClear trend): the rework line halves, the total drops by roughly 15% – the reconstruction line keeps the bill high.
Zero the reconstruction line (every PR already arrives with a written task and evidence): the total falls by nearly half. This is the single most sensitive lever, and the cheapest to pull.
Drop the incident allowance (prototype work, no production exposure): the bill shrinks but stays five figures – and this is also the honest case where verification effort deserves scaling down, as the vibe-coding analysis concedes.

What the sensitivity does not do is flip the sign: under every defensible parameterization, unverified AI volume costs more than the structure that verifies it. Measuring your own inputs is a solved problem – the four metrics compute from git and PR data you already have.

Where the money actually goes - and comes back

The bill above is not a argument against AI coding – the same telemetry that shows the debt shows real throughput gains. It is the price of the unverified variant specifically: the rework line is churn above baseline, the reconstruction line is reviews without a reference, the allowance is defects that reached production. Each line maps to a practice that removes it – written tasks, verification before merge, evidence attached – and the question of when those practices pay for themselves is the ROI calculation, run with the same transparency.

Where Reality Graph fits

Reality Graph attacks the two largest lines of the example: written tasks with verification per run collapse the reconstruction tax, and pre-merge checks move defects from post-merge rework to pre-merge fixes. We make no savings promises and quote no percentages – the honest move is the one this article makes: run the model with your numbers, before and after a pilot, and let your own data speak. The evidence reports make that before/after measurable.

This calculation gives you

A transparent model with sourced anchors and labeled assumptions
The line items debt actually bills under
Sensitivity analysis instead of a cherry-picked point value
A ten-minute path to your own number

It does not give you

A universal cost figure - the example is an example
An argument against AI coding - the gains are real too
Savings promises for any tool, including Reality Graph
Precision on incidents - that error bar is honestly wide

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What does unverified AI code cost a team per year?: There is no honest universal number - the cost depends on volume, rates and incident exposure - but there is honest arithmetic. In our example calculation for a 12-person team (assumptions labeled and swappable), the annual bill lands in the range of one to two full-time engineer salaries, dominated by rework on churned code and review time spent reconstructing intent. The point of the model is not the point value; it is that you can insert your own numbers in ten minutes.
Which of the inputs are measured, and which are assumptions?: Measured, with sources: two-week churn rising toward 5.7% in AI-heavy codebases (GitClear, 211M lines), review time per PR up 91% in high-AI teams (Faros telemetry), and roughly 45% of AI-generated samples failing security tests (Veracode 2025). Assumptions, clearly labeled: your team's output volume, loaded cost per engineer hour, hours per reworked change, and incident frequency. The article's table separates the two classes explicitly.
Isn't churn partly normal and even healthy?: Yes - some churn is fast iteration doing its job, and an honest model does not price all churn as waste. That is why the example prices only the churn delta above the pre-AI baseline (GitClear's ~3.1%) as debt-driven, and why the sensitivity section shows the result under friendlier assumptions. Even the conservative variant stays expensive, which is the finding that matters.
What is the single most expensive line item?: In most parameterizations: review reconstruction - the hour a reviewer spends reverse-engineering what a change was supposed to do, multiplied across every AI-assisted PR. It outweighs rework in teams with high PR volume because it is paid on every change, not just the defective ones. It is also the cheapest line to attack, since a written task per run removes most of it.
Does the calculation include incidents?: As a separate, explicitly uncertain line - incident costs are lumpy and dominated by rare events, so folding them smoothly into an annual average would fake precision. The example carries a modest allowance based on the security-failure rates and flags it as the widest error bar. Teams in regulated or high-exposure domains should model this line with their own incident data, not ours.
What would change the math most?: Two levers, both upstream: written tasks per AI run (collapses the review-reconstruction line and part of the churn delta) and independent validation before merge (moves defects from post-merge rework to pre-merge fixes, which cost a fraction). That is the bridge to the ROI question - when the cost of running those practices undercuts the debt they remove - which the ROI article walks through with the same transparency.

Keep reading

EconomicsReducing LLM Token CostsTokens are mostly context, and context is resent every turn - the five levers that cut the bill without degrading output, each with its quality risk named. No percentages, just mechanics that survive price changes.EconomicsThe Verification ROI CalculationVolume, not headcount, decides: the transparent break-even model for a verification practice - costs per run and fixed block against reconstruction and rework removed - including the honest no-cases.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.