Skip to content

Method

Machine-Checkable Specifications

Last updated: 2026-07-025 min read

A machine-checkable specification translates a prompt into a task the result can be verified against: one goal, explicit boundaries, acceptance criteria phrased as yes/no questions, and a validation plan. The prompt instructs the model; the specification outlives it – as the reference for verification, review, and everyone who touches the change later.

Contents

Why a prompt is not a specification

Acceptance criteria were invented for humans. A product manager wrote them, a developer interpreted them, a tester checked the result – and the slack between vague wording and correct behavior was absorbed by people who could ask follow-up questions. An AI agent does not absorb slack. It builds it – literally, confidently, and without the friction of asking what you meant.

The second problem is persistence. The prompt that carried your intent is gone by review time, which is why reviewers of AI changes end up reconstructing requirements from the diff itself. Research on delegation contracts for coding agents frames reviewability as exactly four questions: what was asked, what the agent was allowed to do, what came back, and what evidence supports it. A specification is the artifact that answers the first two before the run starts – and feeds the spec-vs-implementation check afterwards.

The four building blocks of a checkable task

  1. Goal – one sentence. What must be true after the run that was not true before. If the goal needs three sentences, it is probably two tasks.
  2. Boundaries – the authority envelope. Which files may change, which must not, and which behavior is off-limits. Boundaries are what turn scope creep from a debate into a finding.
  3. Acceptance criteria – 3 to 7 yes/no questions. Each one decidable without discussion. This is the part the verification step will walk through line by line.
  4. Validation plan – which checks count. Tests (ideally written before the run), types, lint, build, and any manual check that cannot be automated – named up front, so “it passed” has a defined meaning.

Compiling a conversational prompt into this form takes minutes and looks like this:

task-compilation.md

Example – not real run data
PROMPT (what you'd naturally type)
"Rate limiting responses are confusing clients, can you make the
429s more helpful?"

COMPILED SPECIFICATION (what the run is verified against)
Goal:      429 responses carry a correct Retry-After header
Boundary:  api/middleware/* only · rate-limit thresholds unchanged
Criteria:  [1] Retry-After present on every 429 response
           [2] value equals remaining window in seconds (±1s)
           [3] empty/malformed client IDs still get a 429, not a 500
           [4] 2xx and other 4xx paths byte-identical to before
Validate:  unit tests (pre-written) · types · lint · build
           manual: header visible behind the CDN (staging)

Rules that make criteria checkable

  • Replace adjectives with numbers. “Fast” becomes “p95 under 200 ms”; “helpful error” becomes “message names the field and the limit”. An adjective is an invitation for the model to decide what you meant.
  • Name the unhappy paths. Empty inputs, duplicates, missing permissions, timeouts. Left unstated, the agent invents its own policy for them – and its self-written tests will confirm that invented policy.
  • Bind criteria to behavior, not implementation. “Uses a token bucket” ages badly and forbids better solutions; “allows 100 requests per minute per client” is the actual requirement.
  • Pick the lightest format that stays decidable. Given-When-Then maps directly onto tests and suits behavior-heavy work; a plain checklist is faster and equally checkable. Format is taste – decidability is the requirement.

Limits and typical mistakes

  • Over-specifying small tasks. The form scales down to three lines. A specification longer than the diff it governs is a smell – split the task or trim the spec.
  • Specifying without checking. A specification nobody compares against is documentation theater. The follow-up comparison is the point of writing it.
  • Expecting completeness. The specification verifies what you thought to state. Formalizing intent completely remains an open research problem – the method reduces the gap, it does not close it.
  • Letting the model write its own criteria unreviewed. Drafting with AI is fine; accepting the draft unread re-creates the circularity the specification exists to break.

Where Reality Graph fits

Reality Graph makes this compilation step the front door of every run: tasks are defined with goal, boundaries, criteria, and a validation plan, the run is checked against them afterwards as part of the verification loop, and the outcome lands in an evidence report. The method works with a text file too – the tool removes the discipline tax.

A checkable specification gives you

  • A persistent record of what was asked and allowed
  • Criteria a verification step can walk mechanically
  • Unhappy paths decided by you, not invented by the model
  • A reference that outlives the prompt and the session

It does not

  • Guarantee the intent itself was complete or wise
  • Require Given-When-Then or any fixed format
  • Pay off without the follow-up comparison
  • Replace tests – it defines which tests count

If these boundaries fit how your team wants to ship:

FAQ

How do I phrase tasks for AI coding tools so the result is checkable?
Compile the prompt into four parts before the run: one goal sentence, explicit boundaries (files and behavior the agent must not touch), three to seven acceptance criteria each phrased as a yes/no question, and a validation plan naming the checks that count. The prompt can stay conversational – the specification is the artifact the result gets verified against.
What makes an acceptance criterion 'machine-checkable'?
It can be answered yes or no without discussion. That usually means replacing adjectives with numbers ('fast' becomes 'p95 under 200 ms'), naming the unhappy paths explicitly (empty input, duplicates, missing permissions), and binding each criterion to an observable behavior instead of an implementation detail.
Do I have to use Given-When-Then?
No. Given-When-Then is the most structured format and maps directly onto tests, which makes it a good default for behavior-heavy tasks. But a plain checklist of yes/no criteria is equally checkable and faster to write. The format matters less than the property: every line must be decidable.
Isn't this a lot of overhead for small tasks?
The form scales down. A one-line fix needs a one-line goal, one boundary, and one criterion – thirty seconds, not a document. The rule of thumb: the specification should be shorter than the diff it governs. If it is not, you are over-specifying or the task is too big for one run.
Why not just write a better prompt?
Because prompts disappear and specifications persist. A prompt is an instruction to the model; a specification is a reference for everyone after the model – the verification step, the reviewer, the auditor, and you in three months. Research on delegation contracts for coding agents shows reviewability depends on exactly this persistent record of what was asked and allowed.
What happens to the specification after the run?
It becomes the input for the spec-vs-implementation check: the diff is compared criterion by criterion, off-scope changes surface as boundary violations, and the outcome is recorded with the change. Without that follow-up comparison, the specification is documentation theater – write it because you will check against it.

Keep reading

Sources

Want to follow the beta, or test it when it opens?

Join early access