Economics
Reducing LLM Token Costs
Last updated: 2026-07-024 min read
LLM token costs in coding workflows are mostly context costs – and context is resent every turn. The five levers that cut the bill without degrading output: focused task context, prompt caching, model routing, session hygiene, and async batching. The constraint that disciplines all five: a wrong answer from missing context costs more in rework than the tokens it saved – so compression means removing the irrelevant, never truncating the relevant.
Contents
Where the tokens actually go
The mental model that makes coding-tool bills legible: input dwarfs output, and input repeats. Every agent turn resends conversation history, file contents and repository context – a long session pays for its own past on every step, and a tool that indexes broadly pays for breadth on every request. Output tokens, the code itself, are usually the smaller line. Two consequences follow immediately: “write shorter prompts” is almost irrelevant, and the levers that work are all about what context enters and how often it re-enters. Current per-token rates live on the vendors’ pricing pages (Anthropic, OpenAI) – they change often enough that this article cites the mechanics, not the numbers.
The five levers, with their quality risks named
| Lever | Mechanism | Quality risk |
|---|---|---|
| Focused task context | Send only what the task needs - boundaries define relevance | Cutting load-bearing context produces confident wrong code |
| Prompt caching | Stable prefixes bill at cached rates; organize context as stable + variable | Low - but stale cached context can outlive its truth |
| Model routing | Small models for mechanical steps, frontier for judgment | Under-modeling hard tasks; route by task class, not by hope |
| Session hygiene | Short sessions; handoff artifacts instead of endless context | Losing state between sessions - the handoff must carry it |
| Async batching | Vendor batch APIs discount non-interactive work | Latency - only for work nobody is waiting on |
The first lever carries the most weight and the most nuance. Its cheapest implementation is a written task with boundaries: by stating what the run may touch, it states what context is relevant – compression guided by intent instead of guesswork. The fourth lever’s craft is the session handoff: persistent state in an artifact beats persistent state in an ever-growing, ever-resent context window.
The quality constraint - measured, not assumed
Every lever above can be over-pulled, and the failure mode is silent: a model missing context does not warn, it guesses – and plausible-but-wrong code costs review and rework that dwarf token savings. METR’s trial is the standing reminder that perceived efficiency and real efficiency diverge. The discipline is to track two curves together: cost per merged change (not per request – requests are cheap, merged correctness is the product) alongside your verification metrics. An optimization that lowers the first curve while churn rises has not saved anything; it has moved cost from the API bill to the debt bill, where it compounds.
Where Reality Graph fits
Reality Graph’s contribution to this topic is structural, and we deliberately quote no percentages: its written tasks with declared boundaries define, per run, what context is relevant – which is the first lever implemented as workflow rather than willpower. Whether and how much that shrinks your bill depends on your tools and volumes; measure it with the two-curve discipline above rather than trusting any vendor’s number, ours included.
This guide gives you
- The context-dominates mental model that makes bills legible
- Five levers ordered by leverage, each with its failure mode
- The two-curve discipline: cost per merge next to quality
- Mechanics that survive vendor price changes
It does not give you
- Current per-token prices - the vendor pages own those
- Savings percentages, for any tool including Reality Graph
- A pass on quality measurement - silent regressions are the trap
- Subscription-tier arbitrage tricks - terms change quarterly
If these boundaries fit how your team wants to ship:
FAQ
- How do you cut the API costs of AI coding tools?
- By attacking where the tokens actually are: context, not output. Five levers, in order of leverage: send only task-relevant context (scoped, focused tasks); exploit prompt caching so repeated prefixes bill at the cached rate; route simple operations to smaller models; keep sessions short so growing context is not resent every turn; and batch asynchronous work where vendors discount it. The constraint on all five: a wrong answer from missing context costs more than the tokens it saved.
- Where do the tokens actually go in AI coding?
- Overwhelmingly into input context, and repeatedly: coding tools resend conversation history, file contents and repository context with every turn, so a long session pays for its own past again and again. Output tokens - the generated code - are usually the smaller share. That is why the effective levers are context levers, and why 'write shorter prompts' barely moves the bill.
- What is context compression, precisely?
- Removing what the model does not need for the task at hand - not truncating what it does. The difference is the whole craft: dropping an irrelevant subsystem from context is free; dropping the interface definition the change depends on produces confident, wrong code whose rework costs more than the saved tokens. A written task with boundaries is the cheapest compression oracle available: it states what is relevant before the run.
- How much does prompt caching save?
- Vendors bill cached input tokens at a substantial discount - current rates are on the providers' pricing pages, which change often enough that transcribing them here would age badly. The practical point is structural: caching pays when your requests share long, stable prefixes (system prompts, standing project context), which rewards organizing context as stable-prefix-plus-variable-suffix instead of shuffling everything per request.
- When does cost optimization hurt quality?
- When it removes load-bearing context or downgrades the model below the task's difficulty. The failure is silent: the model does not report missing context, it guesses - and plausible-but-wrong output costs review time and rework that dwarf token savings. The guardrail is measuring both curves: track cost per merged change alongside your verification metrics, and treat rising churn after an optimization as the louder signal.
- Do these levers apply to subscription tools like Copilot or Cursor?
- Indirectly. Flat-rate subscriptions hide token mechanics until you hit usage-based tiers, overage pricing or bring-your-own-key setups - all three became more common in 2026. The habits transfer regardless: focused tasks and short sessions improve output quality even where they do not touch your bill, and they position you for the metered pricing the market keeps drifting toward.