Skip to content

Security

Security Vulnerabilities in AI Code

Last updated: 2026-07-024 min read

AI-generated code fails security tests at measured, stable rates: across 100+ LLMs, 45% of samples introduced OWASP Top 10 vulnerabilities, cross-site scripting failed in 86% of relevant cases, Java samples failed at 72% – and the rates stayed flat across model generations. The conclusion is uncomfortable and useful: the fix lives in the process around generation, not in waiting for better models.

Contents

The measured baseline

Veracode’s 2025 GenAI Code Security Report is the largest public measurement of the question: over 100 LLMs, four languages, tasks with known-secure solutions. The headline – 45% of samples introduced OWASP Top 10 vulnerabilities – matters less than two structural findings. Per language, the spread is wide: Java 72%, C# 45%, JavaScript 43%, Python 38%. And across model generations, functional quality improved while security performance stayed flat – newer and larger bought no safety. Whatever your model roadmap, the security work stays yours.

The recurring classes, and what catches each

ClassWhy AI produces itWhat catches it
Cross-site scripting (86% fail rate)Encoding depends on output context the model cannot seeSAST rules + security criteria in the task
Injection (SQL, command, log)String concatenation is the statistically common pattern in training dataSAST - the classic, reproducible catch
Hardcoded credentialsExamples in training data hardcode; the model imitatesSecret scanning pre-commit; secrets out of repos entirely
Weak cryptographyOutdated algorithms are overrepresented in old public codeSAST + dependency policies pinning approved primitives
Missing authentication/authorization checksTrust boundaries live in architecture, not in the promptHuman review of trust boundaries; criteria per task
The recurring vulnerability classes in AI-generated code, why generation produces them, and the layer that catches each - no single layer covers the spread (sources: Veracode 2025; defense mapping ours).

The mapping explains the flat-across-generations finding: the worst classes are the ones where security depends on context outside the generation window – where output lands, what the trust boundary is, what the organization considers approved crypto. More parameters do not supply missing context. The general failure taxonomy behind this is in why AI code fails.

The defense stack, deterministic-first

  1. SAST and secret scanning per change. The pattern-shaped classes fall to deterministic tools, reproducibly and cheaply – the division of labor with static analysis is detailed in SonarQube vs. verification.
  2. Security criteria in the task. “Validates input per X, encodes output for Y, touches no auth paths” – written before the run, checkable after it. This is what supplies the context the model lacked.
  3. Human review at the trust boundaries. Relieved of the mechanical layer, review judges what no scanner can: whether the boundary itself is right.
  4. With the BSI/ANSSI baseline under it all: treat generated output as unverified input. With only 48% verifying consistently, the 45% defect supply meets a 52% open gate.

Where Reality Graph fits

Reality Graph is not a security scanner and does not replace SAST. Its contribution to this stack is the second layer: security criteria written into the task, the change verified against them per run, and the outcome – including skipped checks – recorded in an evidence report. Scanners answer “does this code contain known-bad patterns?”; verification answers “did this change respect the security requirements we actually set?” – both, per change, is the posture the numbers argue for.

This analysis gives you

  • The measured class-by-class picture, with study design named
  • The flat-across-generations finding and its consequence
  • A defense stack ordered deterministic-first
  • Language-level risk weighting for your stack

It does not give you

  • A replacement for SAST - Reality Graph is not a scanner
  • An argument against AI on sensitive code - against unverified AI
  • Exact rates for your codebase - measure with your own pipeline
  • A model recommendation - security did not vary usefully by model

If these boundaries fit how your team wants to ship:

FAQ

Which security vulnerabilities does AI code produce most often?
The best-measured picture comes from Veracode's 2025 evaluation of over 100 LLMs: 45% of generated samples introduced OWASP Top 10 vulnerabilities overall, with cross-site scripting the standout - models failed to prevent it in 86% of relevant samples. Injection classes, insecure handling of credentials and weak cryptographic choices round out the recurring set. The pattern is consistent: the classes where security depends on context the model does not see fail worst.
Are newer, larger models more secure?
Measurably no - this is the report's most uncomfortable finding. While functional correctness improved across model generations, security performance stayed flat regardless of model size or training sophistication. The practical consequence: waiting for the next model generation is not a security strategy, and the fix has to live in the process around generation - checks, criteria, gates - rather than in model selection.
Why does AI keep producing these specific vulnerabilities?
Three compounding reasons. Training data: models learned from decades of public code, much of it insecure - the insecure pattern is often the statistically common one. Missing context: whether output encoding is needed depends on where data ends up, which the model often cannot see from the prompt. And objective mismatch: generation optimizes for code that works, and insecure code usually works - the vulnerability is invisible to the 'does it run' test that implicitly guides generation.
Does the programming language matter?
Substantially, per the same evaluation: Java samples failed security tests at 72%, the worst of the four languages tested, against 38% for Python, 43% for JavaScript and 45% for C#. The exact ranking will shift with tasks and models, but the spread itself is the finding - your risk profile depends on your stack, and a team's hardening effort should be weighted accordingly.
What actually catches these vulnerability classes?
A layered answer, deterministic-first: SAST and secret scanning catch the pattern-shaped classes (injection, hardcoded credentials, weak crypto) reproducibly and cheaply - this is where tools like SonarQube earn their place. Security acceptance criteria in the task catch the context-dependent classes the scanner cannot judge. And human review, relieved of the mechanical layer, judges the trust boundaries. No single layer covers the spread; the stack does.
Should we stop using AI for security-sensitive code?
The data supports a narrower conclusion: never merge AI-generated security-sensitive code on generation trust alone. Teams that write security requirements into the task (input validation, encoding, authentication paths as explicit criteria), run SAST per change and gate on human review use AI on sensitive code with measured rather than assumed risk. The 45% figure describes unverified output - it is an argument for the pipeline, not against the tool.

Keep reading

Sources

Want to follow the beta, or test it when it opens?

Join early access