Should I trust autonomous coding agents that ship without intermediate review?

For low-stakes, well-bounded tasks — generating fixtures, scaffolding boilerplate, running a refactor across a hundred files — autonomous agents are useful with a strong test suite as a backstop. For anything that touches security, data integrity, or production traffic, the intermediate review is the workflow. Removing it does not save time; it transfers the time from review to incident response.

How do I get value from AI pair-programming when I'm tired or rushed?

Shrink the unit of work, do not expand it. When you are tired, the failure mode is accepting larger diffs without reading them — the model's confident tone is more persuasive than usual. Move to single-function prompts, run the code after each one, and stop earlier than you would on a fresh day. Tired AI pairing produces tired AI bugs.

Does this workflow work for refactors or only greenfield code?

It works better for refactors than for greenfield, in fact. Refactors have a clear specification — the existing behaviour — and a clear acceptance test — the existing test suite. Greenfield work asks the model to invent specs, which is exactly the case where the failure modes are highest. The checkpoint loop applies to both, but refactors are where the asymmetry pays off the most.

Pair-Programming With an LLM Without Losing the Craft

Pair-programming with an LLM is not the same workflow as pair-programming with a human, and the developers who treat it like a junior pair get better outcomes than the ones who treat it like a senior. The junior framing is not about ability — the model is often capable in surprising directions — it is about where the judgment lives. The judgment stays with you.

The loop that keeps you in control

Treat the model like a fast junior pair: it drafts, you decide. The work cycles, but the judgment never leaves your side — which is exactly why the loop produces code you can stand behind.

The three positions the LLM can play

Human pairing has driver and navigator. AI pairing has three positions, and naming them is the first move toward a workflow you can run repeatably.

Driver — model types, you navigate

The model produces the code, you direct the high-level shape and accept or reject each chunk. This is the position where the velocity gain is highest and the failure rate is highest. The discipline is reading every line the model produces before accepting the next chunk. The most common mistake is letting the model run two or three chunks ahead of your reading.

Navigator — you type, model suggests

You write the code, the model offers completions, alternative phrasings, or warnings about edge cases. This is the lower-risk position and where the model's contribution is closest to the stereotype of a thoughtful pair — naming things you missed, pointing at the boundary case you forgot. Use this position for code that you would not want to delegate but would benefit from a second perspective on.

Reviewer — after the fact

You finish the code, then ask the model to review it. "What could go wrong in this function?" "What edge cases am I missing?" "Read this and tell me which assumption is load-bearing." The reviewer position is the easiest to add to an existing workflow and the one with the most consistent payoff per minute spent.

The checkpoint loop

Whatever position the model is playing, the workflow needs checkpoints — small batches, review before continuing, named decision points. The shape of the loop is the same every time.

State the next small unit. One function. One file. One refactor with a named scope. If you cannot say in a sentence what success looks like, the unit is too big.
Run the prompt. With the constraints from the prompting patterns post — small scope, named conventions, explicit constraints.
Read the output line by line. Not skim. Read. Every method call, every parameter, every assumption.
Run the code. Not just the type checker — the code, with at least one input you chose, not one the model suggested.
Decide: accept, revise, or discard. The decision goes in the commit message. "Accepted the model's implementation, added the leap-year case it missed" is a real artefact that helps the next review.
State the next small unit. Loop.

The discipline of the loop is that every step must complete before the next one starts. The failure mode is collapsing the loop — running three prompts in a row, accepting all three, then reviewing all three at once. By the time you reach the third review, fatigue has set in and the depth of attention drops.

Where the workflow goes wrong

Three failure shapes account for most of the velocity-loss observed when teams adopt AI pair-programming.

Accepting whole files without reading. The model generates a 200-line file, you scan the top, the bottom looks fine, you accept. The middle 150 lines contain assumptions you have not verified. This shows up as production incidents two weeks later that the team cannot trace back to a specific decision.

Asking for too-large units of work. "Build the user dashboard" is not a checkpoint-loop unit. "Add the avatar component to the user dashboard header with these props" is. The size of the unit is the single biggest lever on whether the workflow produces reviewable diffs.

Letting the model invent specs. When the model starts answering questions about what the code should do — "I will add validation on the email field too, assuming you want that" — pause. Decide whether that is actually wanted. The model treats every gap in the spec as permission to fill in the gap with whatever is statistically likely. Some of those guesses are right; some are not what you intended.

The diff discipline

The single sentence that holds the whole workflow together: every accepted change should be a diff you would have signed off in a code review. If you would have asked questions on the PR, ask them now. If you would have requested changes, request them now. If you would have rejected the approach, reject it now. The diff does not care that you generated it; the code will run in production whether the author was a human or a model.

The corollary: if you are accepting AI-generated code at a higher tempo than you would accept human PRs, something is wrong. Either you are reviewing at a depth you would not accept from a colleague, or you are accepting at a quality bar you would not accept from a colleague. Both are workflow bugs worth surfacing.

The version of this we run inside ShareCode is to put two people in one code space — one drives the prompts, the other reviews each output before it is accepted, and they rotate at every checkpoint. Having the reviewer hold the keyboard for the acceptance step is the single change that most reliably stops the loop from collapsing into accept-without-reading.

What stays human

Some parts of the work do not delegate well to the model, and recognising them is half the workflow.

Picking the problem to solve. The model can help execute against a problem statement but cannot tell you which problem is worth solving this week. That decision is structural and stays with the human.
Deciding when to stop. The model will happily keep generating refinements past the point of diminishing returns. Knowing when "good enough" is reached — when the next iteration costs more than it pays back — is a judgment call.
Naming variables that future readers will read. The model can produce passable names but tends toward generic choices. A name like processedUserList is the kind of name that future readers will struggle with; activeUsersForBilling communicates intent. The naming review is a place where human attention pays back disproportionately.

The habit that compounds

AI pair-programming is the workflow where most developers' productivity gains evaporate if they skip the review step. The gains feel large at first because the typing is fast; they shrink as the rework from accepted-but-wrong diffs accumulates. The teams who hold the gains are not the ones with the cleverest tool integrations. They are the ones who never let the checkpoint loop collapse — small batches, real reviews, a diff discipline they would defend to a colleague. That is the habit that compounds.

References & Sources

The primary sources, specifications, and documentation behind this article. Each link opens in a new tab.

On Pair Programming
Birgitta Böckeler, Nina Siessegger · martinfowler.com · 2020
The canonical write-up of driver/navigator roles and ping-pong / strong-style patterns this post adapts to an AI pairing partner.
martinfowler.com
Best practices for using GitHub Copilot
GitHub · GitHub Docs
Recommends small units of work and validating every suggestion — the core of the checkpoint loop and diff discipline.
docs.github.com
Responsible use of GitHub Copilot Chat in your IDE
GitHub · GitHub Docs
Documents the model's limitations and the user's responsibility to review output — why the judgment stays human.
docs.github.com
We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs
Joseph Spracklen, Raveen Wijewickrama, A H M Nazmus Sakib, Anindya Maiti, Bimal Viswanath, Murtuza Jadliwala · arXiv:2406.10279 · 2024
Quantifies the confident-but-wrong output the checkpoint loop exists to catch before it is accepted.
arxiv.org

About the writers

Author

Kajal Pansuriya

Developer Educator, ShareCode

Developer educator at ShareCode. Writes the tutorial track — Python, JavaScript debugging, coding-interview prep, and the everyday code-quality habits that hold up in real codebases.

Python fundamentals & teaching beginnersJavaScript debugging & DevToolsCoding-interview preparationClean code & code review

Kishan Vaghani

Founder & Lead Engineer, ShareCode

Founder of ShareCode. Writes the engineering deep-dives on this site — WebRTC, Firebase Auth, real-time sync, and the production patterns behind the editor itself.

Real-time collaboration & CRDTsWebRTC & low-latency mediaFirebase authentication & security rulesNext.js & full-stack JavaScript

Running a checkpoint loop with a teammate?

Open a shared code space, let one person drive the prompts while the other reviews each output before it is accepted, and rotate after every checkpoint. Most of the failure modes get caught in the rotation.

Open a code space →

Keep reading on the ShareCode blog

AI Workflows

Prompting Patterns That Produce Reviewable Code

AI Failure Modes

The Five Ways AI-Generated Code Goes Wrong

Collaboration

Best Practices for Remote Pair Programming (Advanced)