← All articles
🐛 AI Failure ModesIntermediate · 12 min read

The Five Ways AI-Generated Code Goes Wrong

Confident, plausible-looking output that passes review until production exposes the assumption underneath. Five named modes and the detection move for each.

By Kishan Vaghani · Reviewed by Kajal Pansuriya · Published May 26, 2026

The most expensive AI bugs are not syntax errors. The compiler catches those. The expensive ones are the confident, plausible-looking outputs that pass code review, pass the tests the model wrote alongside the code, ship to staging, and finally break in production when a real-world value exposes the assumption the model quietly made. There are five recurring shapes worth naming.

1. Hallucinated APIs

The model calls a function that does not exist on the library you imported. The function name looks reasonable. The signature looks reasonable. The autocomplete in the model's training data contained a similar method on a similar library and the model generalised.

A real shape: lodash.deepMerge sounds like it should exist next to lodash.merge — and the model will happily generate code that calls it. The actual lodash export is merge (which is already deep) and there is no separate deepMerge. Another common one: lodash.deepFreeze looks plausible alongside Object.freeze but lodash does not export it.

Detection move: for every unfamiliar method call in AI-generated code, search the library's actual exports — the published types file or the package's documentation site — before accepting the diff. The red flag is any function name that you have not seen used before in this codebase or its docs.

2. Plausible-but-wrong security advice

The model will produce encryption code, CSP headers, password hashing setups, and JWT verification flows that read correctly to a casual review and are subtly wrong in ways that matter.

A real shape: AES-GCM encryption where the IV is constructed deterministically from the message identifier "for caching reasons." This destroys the security guarantee of GCM, which depends on the IV being unique per encryption. The code compiles, the unit tests pass, and the system is compromised the first time two messages share an identifier.

Another: a Content-Security-Policy header that includes 'unsafe-inline' alongside a script-src list, which makes the entire script-src list cosmetic. The header looks stricter than no header at all. It is not.

Detection move: never accept AI-generated security code without reading the canonical documentation for the primitive in question — the WebCrypto MDN page for encryption, the OWASP cheat sheet for the relevant attack class, the RFC for the protocol. The model's confidence in this category is systematically miscalibrated.

3. Tests that pass because the model wrote them around its own bug

You ask the model for a function and its tests in the same completion. The tests pass. The code ships. Production breaks. What happened is that the model wrote tests that exercise the paths the model thought about, with the inputs the model imagined, and skipped the cases the model's implementation handles incorrectly.

A real shape: a date-parsing function that does the wrong thing on Feb 29 of a leap year, paired with a test suite that uses March 15 as its only "weird" date. Both halves of the completion share the same blind spot.

Detection move: ask for tests in a separate prompt that names edge cases explicitly, or write the tests yourself and let the model implement against them (see prompting patterns). Tests that come from the same completion as the code under test are not independent evidence.

4. Off-by-one errors in regex

Regex is dense — small character differences change the meaning significantly — and the model is very good at producing patterns that match almost the right set of strings. Almost.

Common shapes: + where * was wanted (matches one-or-more instead of zero-or-more, so empty inputs fail); [a-z] where [a-zA-Z] was wanted (case-sensitive identifiers); a missing \b word boundary that lets the pattern match inside larger words; a greedy .* where a lazy .*? would terminate at the first match.

// "Almost right" email pattern from a model
const emailRe = /^[a-z0-9._]+@[a-z0-9.]+\.[a-z]+$/;

// Misses: uppercase local parts, plus-addressing (user+tag@host),
// TLDs longer than the model's mental model expected.
// Production-shaped users break in week one.

Detection move: never accept AI-generated regex without writing five inputs the regex should match and five it should not, and running both sets. Regex is a category where reading the pattern is harder than testing it.

5. Stale framework versions

The model targets the version of the framework that was dominant when its training corpus was assembled. If you are on a newer version with breaking changes, the output looks correct and is quietly wrong.

A real shape: React code that calls ReactDOM.render in a React 18+ codebase that expects createRoot. The model has seen ReactDOM.render in millions of examples and only a fraction of createRoot. Or Next.js code mixing getServerSideProps with App Router conventions. The syntax compiles in many cases; the runtime semantics are wrong.

Detection move: name the framework version in every prompt that touches framework APIs. "React 18 App Router" is a different instruction from "React". When reviewing, run the code, not just the type-check — version mismatches often type-check fine.

What makes these failures expensive

All five share a property that makes them dangerous: the code compiles, the code runs, and many of the tests pass. The failure surfaces only under conditions the model did not consider — which, by definition, were not in the tests the model wrote. That is what separates these from ordinary bugs: ordinary bugs announce themselves with stack traces. These announce themselves with incidents.

A separate cost is the confidence cost. A human developer who writes a function they are unsure about will flag it in the commit message or in code review. The model never flags uncertainty. Every output reads with the same flat confidence, which means the reviewer has no signal to spend extra attention where the model was guessing.

The review discipline

The single most useful habit is treating AI-generated code like a pull request from a junior developer, not like a draft you wrote. Read every line. Question every method call you have not seen used in this codebase before. Run the code yourself with at least one input the model did not consider. If the diff is too large to do this with full attention, the diff is too large.

A practical checklist for any AI-generated diff:

The habit that compounds

The developers who use AI assistants well are not the ones with the cleverest prompts. They are the ones who never trust the first output — only the third or fourth, after they have read it, run it, and watched it produce the expected result on inputs they chose themselves. The five failure modes above do not disappear with practice. They become recognisable on sight, which is the only durable defence against them.

Related reading

Detecting failure modes is the back-half of the AI workflow. The front-half — shaping the prompt so fewer failure modes appear in the first place — is covered in prompting patterns that produce reviewable code. The security category gets its own deep-dive in catching plausible-but-wrong security advice from LLMs. Both posts sit inside the ai-assisted-development topic.

About the writers

Author

Kishan Vaghani

Founder & Lead Engineer, ShareCode

Founder of ShareCode. Writes the engineering deep-dives on this site — WebRTC, Firebase Auth, real-time sync, and the production patterns behind the editor itself.

Real-time collaboration & CRDTsWebRTC & low-latency mediaFirebase authentication & security rulesNext.js & full-stack JavaScript
Reviewed by

Kajal Pansuriya

Developer Educator, ShareCode

Developer educator at ShareCode. Writes the tutorial track — Python, JavaScript debugging, coding-interview prep, and the everyday code-quality habits that hold up in real codebases.

Python fundamentals & teaching beginnersJavaScript debugging & DevToolsCoding-interview preparationClean code & code review

Caught one of these in a review?

Paste the AI-generated diff and the failure case into a code space, share it with a teammate, and walk through the detection move together. Patterns of failure get easier to spot once the team has named them.

Open a code space