How do you merge AI pull requests with a clear conscience?

By making the conscience rest on evidence instead of impressions: the PR carries its written task, the diff stayed within declared boundaries, validation the model did not author passed, skips are named, and a human read that record before deciding. Five checkable facts replace 'it looks fine' - and each of them takes seconds to confirm when the evidence arrives attached.

Why does gut feel fail specifically on AI pull requests?

Because every calibration signal reviewers rely on is missing or misleading. There is no author whose strengths you know; the code is confidently idiomatic whether right or wrong, so plausibility stops predicting correctness; and perceived quality diverges from real quality - METR's trial found experienced developers feeling faster with AI while being 19% slower. Gut feel was trained on human code; AI code breaks its assumptions.

Isn't distrusting every AI PR the safer default?

Blanket distrust without a workflow produces the worst of both worlds: reviews get slower while verification does not get better, and under deadline pressure the distrust silently collapses into rubber-stamping anyway - Sonar's data shows the pattern at scale, with 96% distrusting and only 48% consistently verifying. Trust is not the input you choose; it is the output of a process that earns it per change.

Is rubber-stamping a discipline problem?

No - and treating it as one is why appeals fail. Rubber-stamping is what any review culture produces when volume outruns capacity: approvals keep flowing because blocking everything is not an option, and really reading everything is not either. The fix is mechanical, not moral: shrink what needs human judgment (machine pre-checks), and give the judgment a cheap starting point (evidence). Capacity problems need capacity solutions.

What changes for the PR author - human or agent operator?

The deliverable grows by one artifact: the change plus its record. For a developer running an agent, that means writing the task before the run and attaching what was verified after it - minutes of work that replace the reviewer's hour of reconstruction. Teams that adopt the norm report a side effect: authors catch their own scope creep and skipped checks before requesting review, because the record makes them visible to the author first.

Does evidence-based trust scale to teams and audits?

That is its main advantage over personal trust: it is transferable. Personal trust lives in pairs of people and resets with every new hire, new tool or first incident. A record per change means any reviewer - and later any auditor - can reach the same conclusion from the same facts. The per-change records accumulate into exactly the audit trail regulated environments ask for.

Trust

Rebuilding Trust in AI Pull Requests

Last updated: 2026-07-024 min read

Trust in AI pull requests is not something you decide to have – it is the output of a process that earns it per change. Merging with a clear conscience means five checkable facts replaced “looks fine”: a written task, a scope check against declared boundaries, validation the model did not author, named skips, and a human who read the record. Gut feel fails on AI code because every signal it was trained on is missing.

Contents

How trust erodes - the measured pattern

The erosion is not hypothetical: 96% of developers distrust AI code while only 48% consistently verify it. That combination - high distrust, low verification - is the signature of trust without a workflow. Volume explains it: with nearly twice the merged PRs and review time up 91%, blanket suspicion cannot afford itself, so it collapses into its opposite - approvals on vibes, doubts kept private. The first incident then converts private doubt into open blanket distrust, which is just as uncalibrated in the other direction. Neither state can tell a good AI PR from a bad one.

Why gut feel fails on AI code specifically

No author model. Human-code review leans on knowing who wrote it - their strengths, their blind spots, whether they test. An AI PR has no stable author to model; yesterday’s flawless run says little about today’s.
Plausibility stops predicting correctness. AI code is confidently idiomatic whether right or wrong - the characteristic failure classes (plausible-but-wrong logic, self-confirming tests) are precisely the ones that read well.
Perception itself is skewed. In METR’s randomized trial, experienced developers felt faster with AI while being 19% slower - the same misjudgment that makes an unverified PR feel safe to merge.

The clear-conscience checklist

#	Question	Answered by
1	What was this change supposed to do?	The written task attached to the PR
2	Did it stay inside its mandate?	Scope check: diff vs. declared boundaries
3	Did independent validation pass?	Results of checks the model did not author
4	What was skipped or left uncertain?	Named skips and open points in the record
5	Do I judge the trade-offs acceptable?	The human decision - the only non-delegable row

The five questions that replace gut feel at the merge decision - each answerable in seconds when the PR arrives with its evidence attached.

Four of the five rows are facts a record answers; only the fifth is judgment. That division is the entire relief: the reviewer’s hour stops going into reconstructing rows one to four (the verification asymmetry) and goes into row five, which is what senior attention was for. The record format behind rows one to four is the evidence report.

The team dynamics: trust becomes transferable

Personal trust is a pairwise asset - it lives between two people, takes months to build, resets with every new hire and shatters on the first incident. Evidence-based trust is a property of the change: any reviewer reaches the same conclusion from the same record, the new team member merges with the same confidence as the veteran, and the first incident triggers a lookup instead of a witch hunt - which change, which checks, what was skipped. Two norms keep the system honest: rubber-stamping is treated as the capacity signal it is, never as an accusation - and rework data like rising churn is read as a workflow finding, not a person finding.

Where Reality Graph fits

Reality Graph produces the four fact rows of the checklist as a byproduct of each run: the written task, the boundary check, independent validation results and named skips arrive with the change as an evidence report, local-first. Row five stays yours - the tool exists to make the human decision cheap and informed, not to make it.

Evidence-based trust gives you

A merge decision resting on five checkable facts
Trust that transfers across people, tools and time
Incidents that trigger lookups instead of blame
Reviewer hours moved from reconstruction to judgment

It does not give you

A reason to stop reading code - judgment still reads
Protection from bad architecture a record cannot show
Instant culture change - the norm needs a few sprints
A verdict machine - row five is always a human

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

How do you merge AI pull requests with a clear conscience?: By making the conscience rest on evidence instead of impressions: the PR carries its written task, the diff stayed within declared boundaries, validation the model did not author passed, skips are named, and a human read that record before deciding. Five checkable facts replace 'it looks fine' - and each of them takes seconds to confirm when the evidence arrives attached.
Why does gut feel fail specifically on AI pull requests?: Because every calibration signal reviewers rely on is missing or misleading. There is no author whose strengths you know; the code is confidently idiomatic whether right or wrong, so plausibility stops predicting correctness; and perceived quality diverges from real quality - METR's trial found experienced developers feeling faster with AI while being 19% slower. Gut feel was trained on human code; AI code breaks its assumptions.
Isn't distrusting every AI PR the safer default?: Blanket distrust without a workflow produces the worst of both worlds: reviews get slower while verification does not get better, and under deadline pressure the distrust silently collapses into rubber-stamping anyway - Sonar's data shows the pattern at scale, with 96% distrusting and only 48% consistently verifying. Trust is not the input you choose; it is the output of a process that earns it per change.
Is rubber-stamping a discipline problem?: No - and treating it as one is why appeals fail. Rubber-stamping is what any review culture produces when volume outruns capacity: approvals keep flowing because blocking everything is not an option, and really reading everything is not either. The fix is mechanical, not moral: shrink what needs human judgment (machine pre-checks), and give the judgment a cheap starting point (evidence). Capacity problems need capacity solutions.
What changes for the PR author - human or agent operator?: The deliverable grows by one artifact: the change plus its record. For a developer running an agent, that means writing the task before the run and attaching what was verified after it - minutes of work that replace the reviewer's hour of reconstruction. Teams that adopt the norm report a side effect: authors catch their own scope creep and skipped checks before requesting review, because the record makes them visible to the author first.
Does evidence-based trust scale to teams and audits?: That is its main advantage over personal trust: it is transferable. Personal trust lives in pairs of people and resets with every new hire, new tool or first incident. A record per change means any reviewer - and later any auditor - can reach the same conclusion from the same facts. The per-change records accumulate into exactly the audit trail regulated environments ask for.

Keep reading

EconomicsWhat Verification Debt CostsThe worked example for a 12-person team: rework, review reconstruction and an incident allowance land at one to two engineer salaries per year - every assumption labeled and swappable for your numbers.EconomicsReducing LLM Token CostsTokens are mostly context, and context is resent every turn - the five levers that cut the bill without degrading output, each with its quality risk named. No percentages, just mechanics that survive price changes.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.