Skip to content

Method

The Two-Pass Review Workflow

Last updated: 2026-07-024 min read

The two-pass review workflow splits review into a machine pass and a human pass: automated verification per change first – types, tests, boundaries, spec comparison, enforced in CI – then human review on the pre-verified diff, focused on architecture and business logic. Machines take the decidable questions; humans keep the judgment and the merge decision.

Contents

Why one pass stopped working

Review used to be one activity because one person could carry both jobs: checking correctness and judging quality. AI volume split those jobs apart. Telemetry across 10,000+ developers shows high-AI teams merging ~98% more PRs with review time up 91%, and 38% of developers finding AI code harder to review than a colleague’s. When the same scarce minutes must carry style nits, type errors, scope checks, and architecture judgment, the mechanical work crowds out the judgment – which is the part humans were actually needed for.

JetBrains’ engineering blog states the fix bluntly: stop sending machine-catchable errors to human review at all. The two-pass workflow is that principle, systematized.

Pass one: the machine pre-check

Pass one runs on every push, as a required status check – nothing reaches a human that failed it:

  1. Deterministic gates. Formatting, lint, types, build, and the test suite – including tests written before the run, so the generating model did not author its own judge.
  2. Boundary check. The diff compared against the declared task boundaries – files and behavior that must not change. Off-scope edits fail the pass.
  3. Spec comparison. The spec-vs-implementation check against the task’s acceptance criteria, with the outcome recorded.
  4. Optional: AI pre-review as hypothesis filter. An AI reviewer flags likely issues across the diff. Cloudflare’s orchestration shows this works at scale – as a screening layer whose output humans act on, never as the verdict.

Pass two: the human review, upgraded

The reviewer receives the diff plus the pass-one results: what was checked, what passed, what was skipped. The job changes from reconstructing intent to exercising judgment – is this the right approach, does it fit the architecture, should this exist at all. The clean division:

ConcernPass 1 – machinePass 2 – human
Formatting, lint, types, buildEnforced gateNever sees it
Test results (incl. pre-written tests)Enforced gateReads the summary
Scope vs. declared boundariesEnforced gateJudges intent of allowed changes
Acceptance criteria (spec comparison)Checked + recordedSpot-checks the record
Likely-bug hypotheses (AI review)Screens + filtersDecides on flagged items
Architecture & system fitCore job
Business logic sanityCore job
Merge decisionNeverAlways – with evidence in view
What belongs in the machine pass and what stays human: the dividing line is decidability - everything answerable yes/no goes to pass one, everything requiring judgment stays in pass two.

Limits and typical mistakes

  • Alibi automation. A green pass-one badge is an input to review, not a substitute for it. If approvals become reflexive, the workflow has failed quietly.
  • Noisy pre-review. An AI reviewer that cries wolf trains the team to ignore pass one entirely. Tune for precision over recall; deterministic gates carry the authority.
  • No written intent. Without a task specification, pass one shrinks to style and tests – real value, but the strongest checks (boundaries, criteria) need a written Soll.
  • Treating pass one as optional. A pre-check that can be skipped under deadline pressure will be – required status checks exist for a reason.

Where Reality Graph fits

Reality Graph is a structured pass one with intent: task boundaries and acceptance criteria defined before the run, the diff checked against them after it, validation the model did not author, and an evidence report as the hand-off artifact into pass two – the missing piece between generic CI checks and the human reviewer, inside the broader verification loop.

The two-pass workflow gives you

  • Human review minutes spent on judgment, not mechanics
  • A consistent machine gate on every single change
  • Reviewers who read pre-verified diffs with evidence attached
  • Scale: pass one grows with CI, not with headcount

It does not

  • Replace the human merge decision – ever
  • Work as an alibi – green badges are inputs, not verdicts
  • Reach full strength without written task intent
  • Require any specific AI reviewer or vendor

If these boundaries fit how your team wants to ship:

FAQ

What does an efficient review workflow for AI code look like?
Two passes with different jobs: pass one is machine verification per change - formatting, types, tests, boundary and spec checks, optionally an AI pre-review as a hypothesis filter - enforced as a required status check. Pass two is human review on the pre-verified diff, focused on architecture, business logic, and the questions no checklist can ask. The human always makes the merge decision.
What belongs in the machine pass and what stays human?
Everything decidable belongs to the machine: style, types, test results, scope against declared boundaries, criteria from the task specification. Everything judgmental stays human: is this the right approach, does it fit the system, is the intent itself sensible. The dividing line is decidability, not difficulty.
Is an AI reviewer the same as a machine pre-check?
It is one ingredient, not the pass itself. An AI reviewer generates hypotheses about problems - useful for clearing mechanical noise, but it does not know what the change was supposed to do. The pre-check earns its authority from deterministic gates (tests, types, boundaries, spec comparison); the AI reviewer adds breadth on top, filtered by a human-set signal threshold.
Doesn't a required pre-check just slow merging down?
It moves waiting from humans to machines. The pre-check runs in CI minutes on every push; the human reviewer - the scarce resource - receives changes that already passed the mechanical layer. Cloudflare's engineering team describes exactly this economics at scale: orchestrated automated review so human attention lands where it changes the outcome.
What is the most common way this workflow fails?
Alibi automation: the pre-check exists, so humans stop reading - or the pre-check is so noisy that everyone learns to ignore it. Both are calibration failures. The pass-one signal must stay high-precision (JetBrains' advice applies: don't send IDE-catchable errors to review at all), and pass two must stay a genuine review, not a rubber stamp on a green checkmark.
Do we need a task specification for this to work?
The workflow improves review without one, but its strongest check - comparing the change against declared intent and boundaries - needs a written task. Five minutes of specification per run unlocks the difference between 'the tests pass' and 'this is verifiably the change we asked for'.

Keep reading

Sources

Want to follow the beta, or test it when it opens?

Join early access