What typical errors does AI-generated code make?

Five classes recur: hallucinated APIs and packages (calls to things that don't exist), silent edge-case errors (happy path works, boundaries fail), scope creep (changes beyond what was asked), self-confirming tests (the model tests its own assumptions), and plausible-but-wrong logic that reads correctly. Each class has a characteristic reason review misses it - and a specific check that catches it.

What are hallucinated APIs and packages?

Calls to functions, parameters, or entire packages that do not exist - the model completes a plausible pattern rather than consulting reality. Research measured open-source models hallucinating package names at roughly 21.7% and commercial models around 5.2%; attackers exploit this by pre-registering commonly hallucinated names (slopsquatting). Nonexistent APIs fail loudly at build time; nonexistent packages can install something malicious silently.

Are the failures the tool's fault or the workflow's?

Both, in layers. The model contributes characteristic error classes; the workflow decides whether they reach production. The same model output is harmless in a pipeline with pre-written tests, boundary checks, and a spec comparison - and dangerous in a merge-on-green-checkmark pipeline. That is why the fix is procedural, not just better models.

Do newer models fix these failure modes?

Rates improve, classes persist. Commercial models hallucinate packages far less than open models, and each generation reduces some error types - but the structural causes (pattern completion without ground truth, self-testing, unstated requirements filled with invented policy) are properties of the setup, not the model version. Plan for the classes, celebrate the rate improvements.

What single practice catches the most of these failures?

A written task with boundaries and acceptance criteria, checked after the run. It converts scope creep from invisible to a boundary violation, unstated edge cases from model-invented policy into explicit criteria, and 'looks right' into criterion-by-criterion yes/no. Validation the model did not author closes the self-confirmation loop.

Concept

Why AI-Generated Code Fails

Last updated: 2026-07-023 min read

AI-generated code fails in characteristic classes rather than random bugs: hallucinated APIs and packages, silent edge-case errors, scope creep, self-confirming tests, and plausible-but-wrong logic. The common root: a model completes plausible patterns without ground truth - and its polish disarms exactly the review that should catch it.

Contents

The five failure classes

Class	What it looks like	Why review misses it	What catches it
Hallucinated APIs & packages	Calls to functions or packages that don't exist (~21.7% package hallucination in open models, ~5.2% commercial)	The names look exactly like real ones	Build/type checks; dependency allow-lists
Silent edge-case errors	Happy path works; empty inputs, duplicates, timeouts misbehave	Reviewers walk the happy path too	Unhappy paths as explicit acceptance criteria
Scope creep	Changes beyond the task - 'improvements' nobody asked for	Off-scope edits look like diligence	Declared boundaries, checked after the run
Self-confirming tests	Model-written tests assert the model's own assumptions	Green suite reads as verification	Tests written before the run; validation the model didn't author
Plausible-but-wrong logic	Reads idiomatically, solves a subtly different problem	Polish builds unearned confidence (61% know this failure)	Spec-vs-implementation check against written intent

The five characteristic failure classes of AI-generated code: what each looks like, why review misses it, and which check catches it.

The common root: plausibility without ground truth

None of these classes is a random defect - they are all the same mechanism viewed from different angles. A language model completes the most plausible continuation of a pattern. Usually the plausible continuation is also correct; the failure classes are the places where plausibility and truth systematically diverge: an API that shouldexist by analogy, a policy for empty input the requirements never stated, a test that enshrines the implementation’s own assumption.

The polish is part of the problem. 61% of developers report AI code that looks correct but is not reliable - and security data puts numbers on the gap: 45% of AI-authored PRs introduce at least one OWASP-Top-10 issue in Veracode’s data. With a human author, sloppy style correlated with sloppy logic and reviewers calibrated on that signal. AI output removed the signal without removing the errors.

The special case: self-confirming tests

The most treacherous class deserves its own paragraph, because it defeats the standard remedy. “Let the AI write tests” sounds like verification - but tests written by the generating model verify the model’s assumptions, not your requirements. If the implementation misread the intent, the tests enshrine the misreading, and the suite goes green on a wrong program. This is the circularity problem that makes an external reference - a written specification - structurally necessary rather than nice to have.

What this means for your pipeline

Per class, one gate. Build and type checks for hallucinations, explicit unhappy-path criteria for silent errors, declared boundaries for scope creep, pre-written tests for self-confirmation, and a spec-vs-implementation check for plausible-but-wrong logic.
Volume multiplies the classes. Each failure rate rides on top of the review bottleneck: more code, subtler failures, flat reading capacity.
Rates change, classes stay. Build the pipeline for the classes; enjoy every model generation that lowers the rates.

Where Reality Graph fits

Reality Graph is a per-run gate against exactly these classes: boundaries catch scope creep, acceptance criteria force the unhappy paths into the open, validation the model did not author breaks self-confirmation, and the evidence report records which checks actually ran.

Knowing the classes gives you

Targeted gates instead of generic 'review harder'
A reviewer briefing: where AI code characteristically lies
An explanation for green-suite production incidents
Sourced rates for the risk conversation

It does not mean

AI code is worse than human code in every dimension
Any class is guaranteed per run - these are rates
Better models make the pipeline unnecessary
Human code lacks its own characteristic failure classes

If these boundaries fit how your team wants to ship:

Get early access See how it works

FAQ

What typical errors does AI-generated code make?: Five classes recur: hallucinated APIs and packages (calls to things that don't exist), silent edge-case errors (happy path works, boundaries fail), scope creep (changes beyond what was asked), self-confirming tests (the model tests its own assumptions), and plausible-but-wrong logic that reads correctly. Each class has a characteristic reason review misses it - and a specific check that catches it.
What are hallucinated APIs and packages?: Calls to functions, parameters, or entire packages that do not exist - the model completes a plausible pattern rather than consulting reality. Research measured open-source models hallucinating package names at roughly 21.7% and commercial models around 5.2%; attackers exploit this by pre-registering commonly hallucinated names (slopsquatting). Nonexistent APIs fail loudly at build time; nonexistent packages can install something malicious silently.
Why does AI code look so convincing while being wrong?: Because the model optimizes for plausibility, not truth: idiomatic style, consistent naming, tidy structure are exactly what it learned to produce. 61% of developers report AI code that looks correct but is not reliable - the polish that builds reviewer confidence carries no information about correctness. Human sloppiness used to be a signal; its absence now misleads.
Are the failures the tool's fault or the workflow's?: Both, in layers. The model contributes characteristic error classes; the workflow decides whether they reach production. The same model output is harmless in a pipeline with pre-written tests, boundary checks, and a spec comparison - and dangerous in a merge-on-green-checkmark pipeline. That is why the fix is procedural, not just better models.
Do newer models fix these failure modes?: Rates improve, classes persist. Commercial models hallucinate packages far less than open models, and each generation reduces some error types - but the structural causes (pattern completion without ground truth, self-testing, unstated requirements filled with invented policy) are properties of the setup, not the model version. Plan for the classes, celebrate the rate improvements.
What single practice catches the most of these failures?: A written task with boundaries and acceptance criteria, checked after the run. It converts scope creep from invisible to a boundary violation, unstated edge cases from model-invented policy into explicit criteria, and 'looks right' into criterion-by-criterion yes/no. Validation the model did not author closes the self-confirmation loop.

Keep reading

ConceptVibe Coding's BillPrompting and shipping without review trades verification for speed - the deferred bill in churn, ~55% security pass rates, fix cycles and cleanup costs, plus the honest counter-position.ComparisonCodeRabbit Alternatives Without the CloudWhat CodeRabbit does well, where its cloud model becomes the sticking point, and four local paths compared - self-hosted, open source, local models, verification layer. Prices as of July 2026.All articlesThe whole collection – 30 cited, dated guides on verifying AI-generated code.