Isn't that just a pull request description?

A PR description is written after the fact, usually by the same model that made the change, and says whatever its author chose to say. An evidence report is anchored to a task that was written down before the run and records validation results – including the negative space: what was not tested.

Who reads evidence reports?

Reviewers first – they get intent and validation status instead of a bare diff. Then future maintainers doing archaeology on a change, engineering leads who need to know how AI-assisted work is verified, and – in regulated contexts – auditors who ask how changes were controlled.

Does writing evidence slow the team down?

The report is generated from what the verification workflow already knows – the written task, the boundary checks, the validation results. The human cost is minutes of defining the task up front, which pays back at review time. Evidence that requires manual bookkeeping would indeed die; that is why it has to fall out of the workflow.

What is the difference between evidence and an audit trail?

Evidence is per run and answers 'was this change verified, and how?'. An audit trail is the accumulation over time and answers 'how does this team control AI-assisted changes in general?'. Reports stored with the code give you both without extra process.

Concept

AI Coding Evidence Reports

Last updated: 2026-07-023 min read

An AI coding evidence report records, per run: what was intended, what changed, what was validated and with what result, what was deliberately skipped, and what remains uncertain – stored with the code, readable by the reviewer, the team, and the future.

Contents

Why 'it works' stopped being enough

When a person writes a change, the review conversation happens with someone who remembers why. When an AI wrote it, that memory does not exist – the prompt is gone, the reasoning is gone, and the summary was written by the same model whose work is in question. Sonar found 53% of developers have seen AI code that looks correct but isn’t reliable; “looks correct” is precisely the problem evidence exists to solve.

Evidence replaces recollection with a record. Not a heavyweight document – a short, structured answer to the five questions every reviewer silently asks.

The five questions a report answers

Intent – what was this run supposed to do, in writing, from before it started?
Change – what actually changed, and did it stay inside the declared boundaries?
Validation – which checks ran (tests, types, lint, build, targeted checks), with what results?
Negative space – what was deliberately not validated, and why? This is the part no self-written summary volunteers.
Decision – who accepted it, knowing all of the above?

evidence-report.md

Sample – illustrative, not real run data

Run:        2026-07-02 · fix-rate-limit-headers
Task:       Return correct Retry-After header on 429 responses
Boundaries: api/middleware/* only · no changes to rate-limit logic
Tool:       Claude Code

Changes:    2 files, +38 −7   (within boundaries ✓)

Validation
  ✓ unit tests (61 passing, 3 new – written before the run)
  ✓ type check, lint, build
  ✗ load test           – SKIPPED: staging env unavailable
  ? header behavior behind CDN – UNCERTAIN, needs manual check

Decision:   APPROVED by mk · load test to follow before release

The format matters less than the discipline: anchored to a pre-written task, honest about gaps, stored where the code lives – not in a chat scrollback.

What evidence changes in practice

Reviews get faster and deeper at once – the reviewer starts from intent and validation status instead of reconstructing both from the diff.
“Looked right, wasn’t” incidents become traceable– when something breaks, the report shows what was and wasn’t checked, turning blame into process improvement.
An audit trail accumulates for free– per-run reports add up to a defensible answer to “how does your team control AI-assisted changes?” – a question engineering leads hear increasingly often.
The team learns – skipped validations and recurring uncertainties are visible patterns, not anecdotes.

Where Reality Graph fits

Reality Graph generates evidence reports as a byproduct of its verification workflow: the task and validation plan are defined before the run, boundary checks and validation results are collected after it, and the report assembles itself – locally, stored with your code. It is in private beta; early access is open for a small group of teams.

What it does

Generates the report from the verification workflow – no manual bookkeeping
Records intent, boundaries, validation results, and open questions per run
Keeps the negative space visible: skipped and uncertain items are first-class
Stores reports locally, with your code – auditable by your team

What it does not do

Invent evidence – what didn't run isn't reported as run
Replace tests or CI – it records what they said, per run
Ship your reports anywhere – they stay in your environment
Claim compliance – evidence supports auditability; certification it is not

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What is an AI coding evidence report?: A short, structured record attached to each AI coding run: what the task was, what changed, what validation ran and with what result, what was deliberately skipped, and what remains uncertain. It turns 'the agent said it's done' into something a reviewer can actually check.
Isn't that just a pull request description?: A PR description is written after the fact, usually by the same model that made the change, and says whatever its author chose to say. An evidence report is anchored to a task that was written down before the run and records validation results – including the negative space: what was not tested.
Who reads evidence reports?: Reviewers first – they get intent and validation status instead of a bare diff. Then future maintainers doing archaeology on a change, engineering leads who need to know how AI-assisted work is verified, and – in regulated contexts – auditors who ask how changes were controlled.
Does writing evidence slow the team down?: The report is generated from what the verification workflow already knows – the written task, the boundary checks, the validation results. The human cost is minutes of defining the task up front, which pays back at review time. Evidence that requires manual bookkeeping would indeed die; that is why it has to fall out of the workflow.
What is the difference between evidence and an audit trail?: Evidence is per run and answers 'was this change verified, and how?'. An audit trail is the accumulation over time and answers 'how does this team control AI-assisted changes in general?'. Reports stored with the code give you both without extra process.

Keep reading

ArchitectureLocal-First SecuritySecurity as verifiable architecture properties instead of marketing guarantees: local-first, advisory by default, no auto-commit, human approval gates.GovernanceAI Coding GovernanceThe controls, workflows, approvals, and evidence a team needs to adopt AI coding tools without losing engineering accountability – and where each building block has limits.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.