Is this just a Copilot problem or do all LLMs have these issues?

All of them. The failure modes are properties of how language models are trained — predicting plausible next tokens against a huge mixed corpus — not properties of any one product. The frequency varies model to model, but the categories are the same across Copilot, ChatGPT, Claude, Gemini, and open-weights models.

How do I review AI-generated code at a higher tempo without losing the rigour?

Constrain the unit of work. A 30-line diff can be read with full attention in under a minute. A 300-line diff cannot be reviewed at all without skimming, which is where the failure modes slip through. The discipline is to never accept a unit of work larger than what you can actually read.

What kinds of code are LLMs reliably good at?

Boilerplate against a well-known framework with an explicit signature — React components with declared props, REST handlers with named routes, SQL queries against a schema the model can see. Anything where the surrounding pattern is dense in the training data and the variation between codebases is small. The further you get from that centre, the higher the failure rate.

The Five Ways AI-Generated Code Goes Wrong

The most expensive AI bugs are not syntax errors. The compiler catches those. The expensive ones are the confident, plausible-looking outputs that pass code review, pass the tests the model wrote alongside the code, ship to staging, and finally break in production when a real-world value exposes the assumption the model quietly made. There are five recurring shapes worth naming.

Five shapes of confident-but-wrong

The cheap failures are the ones that crash immediately. The expensive ones look right, pass the tests the model wrote, and break in production. These are the five recurring shapes worth learning to spot on sight.

1. Hallucinated APIs

The model calls a function that does not exist on the library you imported. The function name looks reasonable. The signature looks reasonable. The autocomplete in the model's training data contained a similar method on a similar library and the model generalised.

A real shape: lodash.deepMerge sounds like it should exist next to lodash.merge — and the model will happily generate code that calls it. The actual lodash export is merge (which is already deep) and there is no separate deepMerge. Another common one: lodash.deepFreeze looks plausible alongside Object.freeze but lodash does not export it.

Detection move: for every unfamiliar method call in AI-generated code, search the library's actual exports — the published types file or the package's documentation site — before accepting the diff. The red flag is any function name that you have not seen used before in this codebase or its docs.

2. Plausible-but-wrong security advice

The model will produce encryption code, CSP headers, password hashing setups, and JWT verification flows that read correctly to a casual review and are subtly wrong in ways that matter.

A real shape: AES-GCM encryption where the IV is constructed deterministically from the message identifier "for caching reasons." This destroys the security guarantee of GCM, which depends on the IV being unique per encryption. The code compiles, the unit tests pass, and the system is compromised the first time two messages share an identifier.

Another: a Content-Security-Policy header that includes 'unsafe-inline' alongside a script-src list, which makes the entire script-src list cosmetic. The header looks stricter than no header at all. It is not.

Detection move: never accept AI-generated security code without reading the canonical documentation for the primitive in question — the WebCrypto MDN page for encryption, the OWASP cheat sheet for the relevant attack class, the RFC for the protocol. The model's confidence in this category is systematically miscalibrated.

3. Tests that pass because the model wrote them around its own bug

You ask the model for a function and its tests in the same completion. The tests pass. The code ships. Production breaks. What happened is that the model wrote tests that exercise the paths the model thought about, with the inputs the model imagined, and skipped the cases the model's implementation handles incorrectly.

A real shape: a date-parsing function that does the wrong thing on Feb 29 of a leap year, paired with a test suite that uses March 15 as its only "weird" date. Both halves of the completion share the same blind spot.

Detection move: ask for tests in a separate prompt that names edge cases explicitly, or write the tests yourself and let the model implement against them (see prompting patterns). Tests that come from the same completion as the code under test are not independent evidence.

4. Off-by-one errors in regex

Regex is dense — small character differences change the meaning significantly — and the model is very good at producing patterns that match almost the right set of strings. Almost.

Common shapes: + where * was wanted (matches one-or-more instead of zero-or-more, so empty inputs fail); [a-z] where [a-zA-Z] was wanted (case-sensitive identifiers); a missing \b word boundary that lets the pattern match inside larger words; a greedy .* where a lazy .*? would terminate at the first match.

// "Almost right" email pattern from a model
const emailRe = /^[a-z0-9._]+@[a-z0-9.]+\.[a-z]+$/;

// Misses: uppercase local parts, plus-addressing (user+tag@host),
// TLDs longer than the model's mental model expected.
// Production-shaped users break in week one.

Detection move: never accept AI-generated regex without writing five inputs the regex should match and five it should not, and running both sets. Regex is a category where reading the pattern is harder than testing it.

5. Stale framework versions

The model targets the version of the framework that was dominant when its training corpus was assembled. If you are on a newer version with breaking changes, the output looks correct and is quietly wrong.

A real shape: React code that calls ReactDOM.render in a React 18+ codebase that expects createRoot. The model has seen ReactDOM.render in millions of examples and only a fraction of createRoot. Or Next.js code mixing getServerSideProps with App Router conventions. The syntax compiles in many cases; the runtime semantics are wrong.

Detection move: name the framework version in every prompt that touches framework APIs. "React 18 App Router" is a different instruction from "React". When reviewing, run the code, not just the type-check — version mismatches often type-check fine.

What makes these failures expensive

All five share a property that makes them dangerous: the code compiles, the code runs, and many of the tests pass. The failure surfaces only under conditions the model did not consider — which, by definition, were not in the tests the model wrote. That is what separates these from ordinary bugs: ordinary bugs announce themselves with stack traces. These announce themselves with incidents.

A separate cost is the confidence cost. A human developer who writes a function they are unsure about will flag it in the commit message or in code review. The model never flags uncertainty. Every output reads with the same flat confidence, which means the reviewer has no signal to spend extra attention where the model was guessing.

The review discipline

The single most useful habit is treating AI-generated code like a pull request from a junior developer, not like a draft you wrote. Read every line. Question every method call you have not seen used in this codebase before. Run the code yourself with at least one input the model did not consider. If the diff is too large to do this with full attention, the diff is too large.

When we review AI-generated code together in a ShareCode code space, the two that surface most often are the hallucinated method call and the test that was written around the bug — both are far easier to catch when a second reader is reading the same diff live and can ask "does that function actually exist?" before the change is accepted.

A practical checklist for any AI-generated diff:

Every imported function actually exists in the imported library.
Any security primitive matches the canonical documentation.
Tests were written independently of the implementation, or by you.
Any regex has at least five positive and five negative example inputs.
Framework version is current — no deprecated APIs.
You have run the code, not just the type checker.

The habit that compounds

The developers who use AI assistants well are not the ones with the cleverest prompts. They are the ones who never trust the first output — only the third or fourth, after they have read it, run it, and watched it produce the expected result on inputs they chose themselves. The five failure modes above do not disappear with practice. They become recognisable on sight, which is the only durable defence against them.

References & Sources

The primary sources, specifications, and documentation behind this article. Each link opens in a new tab.

We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala · arXiv:2406.10279 · 2024
Measures how often LLMs invent non-existent packages — empirical backing for failure mode 1, hallucinated APIs.
arxiv.org
Lodash documentation — _.merge and the Object methods
Lodash · Lodash project
The canonical export list used to confirm that deepMerge and deepFreeze do not exist; merge is already a deep merge.
lodash.com
Content Security Policy (CSP)
MDN Web Docs · Mozilla
Documents why 'unsafe-inline' in script-src defeats the allowlist — the detection move for the CSP example in failure mode 2.
developer.mozilla.org
SubtleCrypto: encrypt() method
MDN Web Docs · Mozilla
Shows the AES-GCM example using a fresh random IV per encryption — the canonical pattern the reused-IV bug violates.
developer.mozilla.org
createRoot
React · react.dev
The React 18 API that replaced ReactDOM.render — reference for the stale-framework-version failure mode 5.
react.dev

About the writers

Author

Kishan Vaghani

Founder & Lead Engineer, ShareCode

Founder of ShareCode. Writes the engineering deep-dives on this site — WebRTC, Firebase Auth, real-time sync, and the production patterns behind the editor itself.

Real-time collaboration & CRDTsWebRTC & low-latency mediaFirebase authentication & security rulesNext.js & full-stack JavaScript

Kajal Pansuriya

Developer Educator, ShareCode

Developer educator at ShareCode. Writes the tutorial track — Python, JavaScript debugging, coding-interview prep, and the everyday code-quality habits that hold up in real codebases.

Python fundamentals & teaching beginnersJavaScript debugging & DevToolsCoding-interview preparationClean code & code review

Caught one of these in a review?

Paste the AI-generated diff and the failure case into a code space, share it with a teammate, and walk through the detection move together. Patterns of failure get easier to spot once the team has named them.

Open a code space →

Keep reading on the ShareCode blog

AI Workflows

Prompting Patterns That Produce Reviewable Code

AI Failure Modes

Catching Plausible-But-Wrong Security Advice From LLMs

JavaScript

JavaScript Debugging Tips: From Console Tricks to Debugging Together