Skip to content

Concept

Why AI-Generated Code Fails

Last updated: 2026-07-023 min read

AI-generated code fails in characteristic classes rather than random bugs: hallucinated APIs and packages, silent edge-case errors, scope creep, self-confirming tests, and plausible-but-wrong logic. The common root: a model completes plausible patterns without ground truth - and its polish disarms exactly the review that should catch it.

Contents

The five failure classes

ClassWhat it looks likeWhy review misses itWhat catches it
Hallucinated APIs & packagesCalls to functions or packages that don't exist (~21.7% package hallucination in open models, ~5.2% commercial)The names look exactly like real onesBuild/type checks; dependency allow-lists
Silent edge-case errorsHappy path works; empty inputs, duplicates, timeouts misbehaveReviewers walk the happy path tooUnhappy paths as explicit acceptance criteria
Scope creepChanges beyond the task - 'improvements' nobody asked forOff-scope edits look like diligenceDeclared boundaries, checked after the run
Self-confirming testsModel-written tests assert the model's own assumptionsGreen suite reads as verificationTests written before the run; validation the model didn't author
Plausible-but-wrong logicReads idiomatically, solves a subtly different problemPolish builds unearned confidence (61% know this failure)Spec-vs-implementation check against written intent
The five characteristic failure classes of AI-generated code: what each looks like, why review misses it, and which check catches it.

The common root: plausibility without ground truth

None of these classes is a random defect - they are all the same mechanism viewed from different angles. A language model completes the most plausible continuation of a pattern. Usually the plausible continuation is also correct; the failure classes are the places where plausibility and truth systematically diverge: an API that shouldexist by analogy, a policy for empty input the requirements never stated, a test that enshrines the implementation’s own assumption.

The polish is part of the problem. 61% of developers report AI code that looks correct but is not reliable - and security data puts numbers on the gap: 45% of AI-authored PRs introduce at least one OWASP-Top-10 issue in Veracode’s data. With a human author, sloppy style correlated with sloppy logic and reviewers calibrated on that signal. AI output removed the signal without removing the errors.

The special case: self-confirming tests

The most treacherous class deserves its own paragraph, because it defeats the standard remedy. “Let the AI write tests” sounds like verification - but tests written by the generating model verify the model’s assumptions, not your requirements. If the implementation misread the intent, the tests enshrine the misreading, and the suite goes green on a wrong program. This is the circularity problem that makes an external reference - a written specification - structurally necessary rather than nice to have.

What this means for your pipeline

  • Per class, one gate. Build and type checks for hallucinations, explicit unhappy-path criteria for silent errors, declared boundaries for scope creep, pre-written tests for self-confirmation, and a spec-vs-implementation check for plausible-but-wrong logic.
  • Volume multiplies the classes. Each failure rate rides on top of the review bottleneck: more code, subtler failures, flat reading capacity.
  • Rates change, classes stay. Build the pipeline for the classes; enjoy every model generation that lowers the rates.

Where Reality Graph fits

Reality Graph is a per-run gate against exactly these classes: boundaries catch scope creep, acceptance criteria force the unhappy paths into the open, validation the model did not author breaks self-confirmation, and the evidence report records which checks actually ran.

Knowing the classes gives you

  • Targeted gates instead of generic 'review harder'
  • A reviewer briefing: where AI code characteristically lies
  • An explanation for green-suite production incidents
  • Sourced rates for the risk conversation

It does not mean

  • AI code is worse than human code in every dimension
  • Any class is guaranteed per run - these are rates
  • Better models make the pipeline unnecessary
  • Human code lacks its own characteristic failure classes

If these boundaries fit how your team wants to ship:

FAQ

What typical errors does AI-generated code make?
Five classes recur: hallucinated APIs and packages (calls to things that don't exist), silent edge-case errors (happy path works, boundaries fail), scope creep (changes beyond what was asked), self-confirming tests (the model tests its own assumptions), and plausible-but-wrong logic that reads correctly. Each class has a characteristic reason review misses it - and a specific check that catches it.
What are hallucinated APIs and packages?
Calls to functions, parameters, or entire packages that do not exist - the model completes a plausible pattern rather than consulting reality. Research measured open-source models hallucinating package names at roughly 21.7% and commercial models around 5.2%; attackers exploit this by pre-registering commonly hallucinated names (slopsquatting). Nonexistent APIs fail loudly at build time; nonexistent packages can install something malicious silently.
Why does AI code look so convincing while being wrong?
Because the model optimizes for plausibility, not truth: idiomatic style, consistent naming, tidy structure are exactly what it learned to produce. 61% of developers report AI code that looks correct but is not reliable - the polish that builds reviewer confidence carries no information about correctness. Human sloppiness used to be a signal; its absence now misleads.
Are the failures the tool's fault or the workflow's?
Both, in layers. The model contributes characteristic error classes; the workflow decides whether they reach production. The same model output is harmless in a pipeline with pre-written tests, boundary checks, and a spec comparison - and dangerous in a merge-on-green-checkmark pipeline. That is why the fix is procedural, not just better models.
Do newer models fix these failure modes?
Rates improve, classes persist. Commercial models hallucinate packages far less than open models, and each generation reduces some error types - but the structural causes (pattern completion without ground truth, self-testing, unstated requirements filled with invented policy) are properties of the setup, not the model version. Plan for the classes, celebrate the rate improvements.
What single practice catches the most of these failures?
A written task with boundaries and acceptance criteria, checked after the run. It converts scope creep from invisible to a boundary violation, unstated edge cases from model-invented policy into explicit criteria, and 'looks right' into criterion-by-criterion yes/no. Validation the model did not author closes the self-confirmation loop.

Keep reading

Sources

Want to follow the beta, or test it when it opens?

Join early access