What is proof-carrying coding, and how does it work in practice?

Proof-carrying coding is the principle that every code change - especially AI-generated ones - arrives together with the evidence needed to accept it: the written task it implements, the checks that ran, their results, and what was skipped. The receiver verifies the evidence instead of reconstructing the work, which is dramatically cheaper. In practice it means an evidence record per run, stored with the change, and a review that starts from verified facts.

Where does the term come from?

From George Necula's proof-carrying code (POPL 1997): a mechanism where a host can safely execute a program from an untrusted source because the program carries a formal, machine-checkable safety proof - and checking a proof is far cheaper than constructing one. Proof-carrying coding borrows the architecture, not the formalism: the producer of a change attaches what the receiver needs to trust it cheaply.

Are these actual proofs, like in Necula's work?

No, and the difference deserves plain statement. Necula's proofs were formal and machine-verified, covering narrow safety properties with mathematical certainty. The evidence in proof-carrying coding - task conformance, test results, boundary checks - is weaker: it raises confidence rather than establishing certainty, but it covers the much broader question of whether a change does what was asked. Weaker guarantee, wider scope; the economics carry over intact.

Why does the asymmetry matter so much?

Because it decides whether evidence is worth producing. Checking an attached evidence record takes a reviewer minutes; reconstructing what an AI run did - reverse-engineering intent from a diff, re-running checks, guessing at skips - takes an hour, repeated per review round. When verification is much cheaper than reproduction, attaching evidence pays for itself immediately. That asymmetry was Necula's insight in 1997 and it is the entire business case today.

How does this differ from an evidence report?

The evidence report is the artifact - what gets recorded per run, in which structure, with a labeled sample. Proof-carrying coding is the principle the artifact serves: producers attach, receivers verify, nothing is accepted on trust alone. You can read them in either order; the report page shows the how, this page argues the why.

Concept

Proof-Carrying Coding

Last updated: 2026-07-024 min read

Proof-carrying coding is the principle that every change – especially an AI-generated one – ships together with the evidence needed to accept it: task, checks, results, skips. It borrows the architecture of Necula’s proof-carrying code (1997): untrusted producers attach what receivers can verify cheaply, because checking evidence costs minutes while reconstructing work costs hours. The proofs here are evidence, not formal proofs – weaker guarantee, far wider scope, same economics.

Contents

The 1997 idea: trust through checkable artifacts

In Proof-Carrying Code (POPL 1997, later awarded most influential paper of its year), George Necula solved an adversarial trust problem elegantly: a host must run a binary from an untrusted source. Instead of trusting the source or laboriously analyzing the binary, the source attaches a formal safety proof – and the host runs a small, fast proof checker. The deep insight is asymmetric effort: constructing the proof is expensive and falls on the producer; checking it is cheap and falls on the consumer. The host stays sovereign without doing the producer’s work.

The transfer to AI coding - honestly marked as analogy

AI coding recreates the trust setup almost exactly: a prolific, not-fully-trusted producer (the model) delivers work to a receiver (your team) that cannot afford to redo it. With 96% distrusting AI code and only 48% consistently verifying, most teams currently resolve the dilemma by trusting anyway. Proof-carrying coding resolves it the 1997 way – with one honest difference that must be stated, not hidden:

	Proof-carrying code (1997)	Proof-carrying coding (2026)
Untrusted producer	Code supplier (possibly adversarial)	The generating model/agent
Attached artifact	Formal, machine-checkable safety proof	Evidence record: task, checks, results, skips
Receiver's job	Run a small proof checker	Verify the evidence, judge the trade-offs
Property covered	Narrow safety properties, with certainty	Task conformance and validation, with confidence
The economics	Checking ≪ proving	Checking evidence ≪ reconstructing the run

Necula's proof-carrying code and proof-carrying coding, mapped - the architecture transfers intact; the guarantee is deliberately traded from certainty on narrow properties to confidence on the broad question.

The bottom row is why the analogy is load-bearing rather than decorative: in both systems, the entire design exists because verification is radically cheaper than reproduction. Evidence that costs more to check than redoing the work would be worthless – which is also the quality bar for what belongs in the record.

What the principle demands in practice

Producers attach, always. Every AI run delivers its change plus its record – the written task, the validations that ran, their results, what was skipped. No record, no review request.
Receivers verify, not reconstruct. The review starts from the evidence: scope against boundaries, criteria against results. Judgment time goes to architecture and trade-offs, not archaeology.
Nothing merges on trust alone. “The agent said it passed” is a claim, not evidence – validation authored outside the generating session is what makes the record checkable.
The record persists. Stored with the code, the records accumulate into the audit trail nobody had to write retroactively. The artifact’s structure, with a labeled sample, is on the evidence reports page.

The 2026 revival - and the honest limits

The formal end of the idea is being rebuilt for agents: 2026 research on proof-carrying agent actions has agent actions carry machine-checkable justifications that a gate verifies before execution – Necula’s architecture, re-aimed at runtime governance. The pragmatic end, described above, is deployable today. Its limits deserve the same clarity: evidence raises confidence, it does not prove correctness; a record can be complete and the architecture still wrong, which is why the human gate stays; and the quality of the whole scheme is bounded by the quality of the task – vague mandates produce unfalsifiable evidence.

Where Reality Graph fits

Proof-carrying coding is Reality Graph’s operating principle stated as a concept: every run is verified against its written task, and the change travels with its evidence report – produced as a byproduct, checked in minutes, stored with the code, local-first. The principle stands without any specific tool; the tool exists because the principle is tedious to uphold by hand at AI volume.

This principle gives you

Reviews that start from verified facts, not archaeology
The verification asymmetry working for you, per change
An audit trail accumulating as a byproduct
A 30-year-old, award-winning architecture as foundation

It does not give you

Formal proofs - the evidence raises confidence, not certainty
A substitute for the human merge decision
Value from vague tasks - checkable mandates are the precondition
Necula's guarantees - the analogy is honest about being one

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What is proof-carrying coding, and how does it work in practice?: Proof-carrying coding is the principle that every code change - especially AI-generated ones - arrives together with the evidence needed to accept it: the written task it implements, the checks that ran, their results, and what was skipped. The receiver verifies the evidence instead of reconstructing the work, which is dramatically cheaper. In practice it means an evidence record per run, stored with the change, and a review that starts from verified facts.
Where does the term come from?: From George Necula's proof-carrying code (POPL 1997): a mechanism where a host can safely execute a program from an untrusted source because the program carries a formal, machine-checkable safety proof - and checking a proof is far cheaper than constructing one. Proof-carrying coding borrows the architecture, not the formalism: the producer of a change attaches what the receiver needs to trust it cheaply.
Are these actual proofs, like in Necula's work?: No, and the difference deserves plain statement. Necula's proofs were formal and machine-verified, covering narrow safety properties with mathematical certainty. The evidence in proof-carrying coding - task conformance, test results, boundary checks - is weaker: it raises confidence rather than establishing certainty, but it covers the much broader question of whether a change does what was asked. Weaker guarantee, wider scope; the economics carry over intact.
Why does the asymmetry matter so much?: Because it decides whether evidence is worth producing. Checking an attached evidence record takes a reviewer minutes; reconstructing what an AI run did - reverse-engineering intent from a diff, re-running checks, guessing at skips - takes an hour, repeated per review round. When verification is much cheaper than reproduction, attaching evidence pays for itself immediately. That asymmetry was Necula's insight in 1997 and it is the entire business case today.
How does this differ from an evidence report?: The evidence report is the artifact - what gets recorded per run, in which structure, with a labeled sample. Proof-carrying coding is the principle the artifact serves: producers attach, receivers verify, nothing is accepted on trust alone. You can read them in either order; the report page shows the how, this page argues the why.
Is anyone applying this to AI agents formally?: Yes - the idea is having a visible revival. 2026 research on proof-carrying agent actions applies Necula's architecture to runtime governance of agent systems: actions carry machine-checkable justifications that a gate verifies before execution. The formal end of the spectrum is being rebuilt for agents while the pragmatic end - evidence-attached changes - is deployable in any team today.

Keep reading

TrustRebuilding Trust in AI Pull RequestsGut feel fails on AI code - no author to model, plausibility decoupled from correctness, perception skewed. The five-question clear-conscience checklist, and why evidence makes trust transferable.EconomicsWhat Verification Debt CostsThe worked example for a 12-person team: rework, review reconstruction and an incident allowance land at one to two engineer salaries per year - every assumption labeled and swappable for your numbers.All articlesThe whole collection – 58 cited, dated guides on verifying AI-generated code.