Concept
Why AI-Generated Code Fails
Last updated: 2026-07-023 min read
AI-generated code fails in characteristic classes rather than random bugs: hallucinated APIs and packages, silent edge-case errors, scope creep, self-confirming tests, and plausible-but-wrong logic. The common root: a model completes plausible patterns without ground truth - and its polish disarms exactly the review that should catch it.
Contents
The five failure classes
| Class | What it looks like | Why review misses it | What catches it |
|---|---|---|---|
| Hallucinated APIs & packages | Calls to functions or packages that don't exist (~21.7% package hallucination in open models, ~5.2% commercial) | The names look exactly like real ones | Build/type checks; dependency allow-lists |
| Silent edge-case errors | Happy path works; empty inputs, duplicates, timeouts misbehave | Reviewers walk the happy path too | Unhappy paths as explicit acceptance criteria |
| Scope creep | Changes beyond the task - 'improvements' nobody asked for | Off-scope edits look like diligence | Declared boundaries, checked after the run |
| Self-confirming tests | Model-written tests assert the model's own assumptions | Green suite reads as verification | Tests written before the run; validation the model didn't author |
| Plausible-but-wrong logic | Reads idiomatically, solves a subtly different problem | Polish builds unearned confidence (61% know this failure) | Spec-vs-implementation check against written intent |
The common root: plausibility without ground truth
None of these classes is a random defect - they are all the same mechanism viewed from different angles. A language model completes the most plausible continuation of a pattern. Usually the plausible continuation is also correct; the failure classes are the places where plausibility and truth systematically diverge: an API that shouldexist by analogy, a policy for empty input the requirements never stated, a test that enshrines the implementation’s own assumption.
The polish is part of the problem. 61% of developers report AI code that looks correct but is not reliable - and security data puts numbers on the gap: 45% of AI-authored PRs introduce at least one OWASP-Top-10 issue in Veracode’s data. With a human author, sloppy style correlated with sloppy logic and reviewers calibrated on that signal. AI output removed the signal without removing the errors.
The special case: self-confirming tests
The most treacherous class deserves its own paragraph, because it defeats the standard remedy. “Let the AI write tests” sounds like verification - but tests written by the generating model verify the model’s assumptions, not your requirements. If the implementation misread the intent, the tests enshrine the misreading, and the suite goes green on a wrong program. This is the circularity problem that makes an external reference - a written specification - structurally necessary rather than nice to have.
What this means for your pipeline
- Per class, one gate. Build and type checks for hallucinations, explicit unhappy-path criteria for silent errors, declared boundaries for scope creep, pre-written tests for self-confirmation, and a spec-vs-implementation check for plausible-but-wrong logic.
- Volume multiplies the classes. Each failure rate rides on top of the review bottleneck: more code, subtler failures, flat reading capacity.
- Rates change, classes stay. Build the pipeline for the classes; enjoy every model generation that lowers the rates.
Where Reality Graph fits
Reality Graph is a per-run gate against exactly these classes: boundaries catch scope creep, acceptance criteria force the unhappy paths into the open, validation the model did not author breaks self-confirmation, and the evidence report records which checks actually ran.
Knowing the classes gives you
- Targeted gates instead of generic 'review harder'
- A reviewer briefing: where AI code characteristically lies
- An explanation for green-suite production incidents
- Sourced rates for the risk conversation
It does not mean
- AI code is worse than human code in every dimension
- Any class is guaranteed per run - these are rates
- Better models make the pipeline unnecessary
- Human code lacks its own characteristic failure classes
If these boundaries fit how your team wants to ship:
FAQ
- What typical errors does AI-generated code make?
- Five classes recur: hallucinated APIs and packages (calls to things that don't exist), silent edge-case errors (happy path works, boundaries fail), scope creep (changes beyond what was asked), self-confirming tests (the model tests its own assumptions), and plausible-but-wrong logic that reads correctly. Each class has a characteristic reason review misses it - and a specific check that catches it.
- What are hallucinated APIs and packages?
- Calls to functions, parameters, or entire packages that do not exist - the model completes a plausible pattern rather than consulting reality. Research measured open-source models hallucinating package names at roughly 21.7% and commercial models around 5.2%; attackers exploit this by pre-registering commonly hallucinated names (slopsquatting). Nonexistent APIs fail loudly at build time; nonexistent packages can install something malicious silently.
- Why does AI code look so convincing while being wrong?
- Because the model optimizes for plausibility, not truth: idiomatic style, consistent naming, tidy structure are exactly what it learned to produce. 61% of developers report AI code that looks correct but is not reliable - the polish that builds reviewer confidence carries no information about correctness. Human sloppiness used to be a signal; its absence now misleads.
- Are the failures the tool's fault or the workflow's?
- Both, in layers. The model contributes characteristic error classes; the workflow decides whether they reach production. The same model output is harmless in a pipeline with pre-written tests, boundary checks, and a spec comparison - and dangerous in a merge-on-green-checkmark pipeline. That is why the fix is procedural, not just better models.
- Do newer models fix these failure modes?
- Rates improve, classes persist. Commercial models hallucinate packages far less than open models, and each generation reduces some error types - but the structural causes (pattern completion without ground truth, self-testing, unstated requirements filled with invented policy) are properties of the setup, not the model version. Plan for the classes, celebrate the rate improvements.
- What single practice catches the most of these failures?
- A written task with boundaries and acceptance criteria, checked after the run. It converts scope creep from invisible to a boundary violation, unstated edge cases from model-invented policy into explicit criteria, and 'looks right' into criterion-by-criterion yes/no. Validation the model did not author closes the self-confirmation loop.
Keep reading
Sources
- Sonar – State of Code: 61% report AI code that looks correct but isn't reliable (2026)
- Veracode data (45% of AI PRs with an OWASP Top-10 issue) – cited in metacto's review standards guide (2026)
- Package hallucination rates (~21.7% open-source, ~5.2% commercial models) – research summarized in metacto (2026)
- Sohail Saifi – AI code has a testing problem: it verifies its own assumptions (2026)
- GitClear – AI Copilot Code Quality: duplication up 8x, refactoring collapsing (2025)