Claude Code Review: when the AI reviews its own work

Anthropic shipped Code Review for Claude Code on March 9, 2026. A team of agents runs a deep review on every pull request. They built it for themselves first, then opened it as a research preview for Team and Enterprise customers.

The announcement from Anthropic put it plainly:

Code output per Anthropic engineer is up 200% this year and reviews were the bottleneck. Personally, I’ve been using it for a few weeks and have found it catches many real bugs that I would not have noticed otherwise.

Not everyone is convinced. Most of the software developers have the following question:

I don’t get it. First we let it create code, then we let it do code review. Why not just have the code created correctly the first time? It feels like I have to spend my tokens twice for something that should be right from the start.

Fair question. Boris Cherny, the creator of Claude Code, responded:

Roughly, the more tokens you throw at a coding problem, the better the result is. We call this test time compute. One way to make the result even better is to use separate context windows. This is what makes subagents work, and also why one agent can cause bugs and another (using the same exact model!) can find them. In a way, it’s similar to engineers — if I cause a bug, my coworker reviewing the code might find it more reliably than I can. In the limit, agents will probably write perfect bug-free code. Until we get there, multiple uncorrelated context windows tends to be a good approach.

The review bottleneck nobody planned for

Here’s the thing nobody predicted when AI coding tools started taking off: the bottleneck would move, not disappear.

AI tools made code generation fast. Really fast. Engineers at companies using AI assistants are pushing 2-3x more code than a year ago. But someone still has to look at all that code before it ships. And that someone is usually a senior engineer who was already busy.

The numbers tell the story. Since AI coding tools became widespread, PR review times have spiked by 91%. Senior engineers are spending 19% more of their time on review. One enterprise reported a 28% increase in code output, but 30-40% of it was AI-generated and needed careful human checking. Teams using AI heavily are merging 98% more pull requests while review capacity stays flat.

The traditional deal in software teams used to work like this: junior developers write code, senior developers review it and teach them along the way. Now junior developers prompt AI to write code, and senior developers spend their days quality-checking machine output. The mentorship part has quietly disappeared. Entry-level developer job postings dropped 60% between 2022 and 2024. That pipeline of juniors becoming seniors is getting thinner, and nobody has a good plan for what happens in five years when there aren’t enough experienced engineers to review anything, human-written or otherwise.

Anthropic’s own engineers felt this directly. Before Code Review, only 16% of their PRs received substantive review comments. The rest got skimmed and approved. That’s not a failure of discipline. It’s what happens when output doubles and review capacity doesn’t.

How Claude Code Review actually works

When a PR opens, Code Review dispatches a team of agents. Not one agent. A team. They work in parallel: hunting for bugs, verifying each other’s findings to filter false positives, and ranking issues by severity. The output lands on the PR as a summary comment plus inline annotations on specific lines.

The system scales with the change. A thousand-line refactor gets a deep read from multiple agents. A five-line typo fix gets a lightweight pass. Reviews take about 20 minutes on average.

The cost structure reflects the depth: $15-25 per review, billed on token usage, scaling with PR complexity. That’s not cheap. Anthropic positions it as a deeper alternative to their open-source Claude Code GitHub Action, which handles lighter-weight checks.

Admins can control spend with repository-level toggles, monthly caps, and an analytics dashboard showing review costs and acceptance rates.

What it actually catches

The numbers from Anthropic’s internal use are striking. On large PRs (over 1,000 lines changed), 84% get findings, averaging 7.5 issues per PR. On small PRs under 50 lines, 31% get findings, averaging 0.5 issues. Engineers disagreed with less than 1% of surfaced findings.

Two specific cases stand out.

A one-line change to a production service looked routine. The kind of diff that normally gets a quick approval because it seems too small to contain anything interesting. Code Review flagged it as critical. The change would have broken authentication for the entire service. The engineer who submitted it said afterward they wouldn’t have caught it on their own.

In TrueNAS’s open-source middleware, Code Review caught a pre-existing bug during a ZFS encryption refactor. A type mismatch was silently wiping the encryption key cache on every sync. It was lurking in adjacent code that the PR happened to touch. A human scanning the changeset wouldn’t have gone looking for it.

These aren’t hypothetical examples. They’re the kind of bugs that ship to production on a Friday afternoon because the reviewer had twelve other PRs in their queue and this one looked safe.

Why separate context windows matter

Boris Cherny’s response to the skeptical user’s is worth unpacking further, because it gets at something counterintuitive about how these models work.

When an agent writes code, it’s operating inside a context window that holds its reasoning chain, the decisions it made, the compromises it accepted. Bugs hide in the assumptions baked into that chain. The writing agent can’t easily see them because it built the logic around them.

A separate agent with a fresh context window has no loyalty to those assumptions. It reads the code cold. It asks “does this actually do what it claims to do?” without carrying the baggage of “well, I wrote it this way because of X.”

This mirrors how human review works. I’ve written code that I was sure was correct, read through it three times, and then a colleague caught a bug in thirty seconds. Not because they’re smarter, but because they didn’t share my blind spots.

The “test time compute” idea takes this further. Broadly, throwing more compute at inference time (more tokens, more passes, more agents) tends to produce better results. Separate context windows are one of the most effective ways to spend that compute. It’s the same principle behind why having five people proofread a document catches more errors than having one person proofread it five times.

The real question: is this actually useful, or is it AI fixing AI?

The cynical reading is obvious: AI writes buggy code, so now we need more AI to check the buggy code. That feels circular.

But this critique proves too much. Humans write buggy code too. We’ve always had code review. We’ve always had QA teams. We’ve always had staging environments and canary deployments and all the other mechanisms for catching mistakes before they reach users. Adding AI review to AI-generated code isn’t philosophically different from adding human review to human-generated code.

The real question is whether the cost math works. A Code Review that costs $20 and catches a bug that would have caused a production incident worth days of engineering time pays for itself fast. At less than 1% false positive rate, the signal-to-noise ratio is high enough that developers aren’t drowning in irrelevant comments.

There are real limitations, though. Independent audits have found that automated review still misses architectural flaws, sometimes flags race conditions that can’t actually occur, and struggles with domain-specific context. One audit found Claude flagged cosmetic issues while completely missing that an authentication system should have used JWKS standards. It’s good at catching mechanical bugs. It’s still weak at asking “should we be building this at all?”

That’s fine. Human reviewers should be focusing on the architectural and design questions anyway. Offloading the bug-hunting frees them up to do exactly that.

The competitive landscape

Claude Code Review isn’t arriving in a vacuum. GitHub Copilot has its own code review feature. CodeRabbit, Qodo, and a growing list of startups are all chasing this space. The common thread: everyone sees the review bottleneck, and everyone is building agents to address it.

What sets Anthropic’s approach apart is the multi-agent architecture. Most competitors run a single model pass over the diff. Claude Code Review dispatches a team that cross-checks findings. The tradeoff is cost and latency (20 minutes and $15-25 per review) versus the lighter-weight alternatives that give faster, cheaper, but shallower feedback.

Five predictions for where this goes

1. Review will split into two tiers. Mechanical review (bug detection, security patterns, type safety, error handling) will become fully automated within a year. Architectural review (does this design make sense, does this belong in this service, are we building the right thing) will remain human for much longer. The tools that acknowledge this split honestly, rather than claiming to do both, will win.

2. The “tokens twice” problem will merge. Right now, code generation and code review happen as separate steps: an agent writes, then a different agent reviews. Within 18 months, this will collapse into a single pipeline where the writing process includes multiple internal verification passes. You won’t think of it as “review” anymore, the same way you don’t think of a compiler’s type checker as “reviewing” your code.

3. The cost will drop by 10x. A $20 review will become a $2 review. Smaller, faster models will handle the first pass. Bigger models will be reserved for the complex cases. This will matter because right now, the pricing limits adoption to large engineering organizations. Cheaper reviews open this up to solo developers and small teams.

4. Review data will become training data. Every finding that an engineer marks as correct or incorrect is a signal. Companies running thousands of reviews per month are generating the dataset for the next generation of review models. Anthropic running this on their own PRs first isn’t just dogfooding. It’s building a feedback loop.

5. The mentorship gap will get worse before it gets better. As review gets automated, the last remaining touchpoint where senior engineers teach junior engineers is disappearing. Nobody has a good solution for this yet. The teams that figure out how to preserve knowledge transfer in an AI-heavy workflow will have a real competitive advantage in hiring and retention over the next five years.

Where this leaves us

Claude Code Review is a pragmatic solution to a real problem. Code output has outpaced code review capacity, and bugs are slipping through the cracks. Throwing a team of agents at every PR is an expensive but effective response.

The user’s who asked “why not just have the code created correctly the first time?” are asking the right question at the wrong time. Someday, agents probably will write near-perfect code. Boris Cherny said as much: “In the limit, agents will probably write perfect bug-free code.” But we’re not at the limit yet. We’re in the messy middle where AI generates more code than humans can review and the best available answer is more AI, pointed in a different direction, checking the work.

The interesting thing about this moment is how familiar it feels. Twenty years ago, we added automated testing alongside manual testing. Ten years ago, we added CI/CD alongside deployment scripts. Now we’re adding AI review alongside human review. Each time, the pattern is the same: automate the parts humans are bad at (catching mechanical errors at scale), free humans to do what they’re good at (judgment calls about design and intent), and accept that the transition period will be awkward and expensive.

We’re in the awkward and expensive part right now. But the authentication bug that got caught before it shipped? That’s a pretty good argument for staying the course.

“Same model, different context window, different blind spots. That’s not a paradox. That’s how review has always worked.”-Rushi

Rushi's

Ctrl+AI+Ship