Adversarial Coding — Using Competing Models as Code Reviewers

Three months ago I wrote about a feature-workflow plugin that added structure to how I work with Claude Code. Capture ideas, plan them, implement them, ship them. The structure helped — I stopped losing track of what I was building mid-session and started actually finishing things.

But the review step was weak. The plugin had its own internal review skill, and Claude was reviewing its own code but it was too agreeable. The same blind spots persisted across features. I'd work on something and it would take several iterations of testing before it was functional.

The plugin is now at v7, and the biggest change is that internal review is gone entirely. It's been replaced with external reviewers — different AI models running in separate terminals, reading the implementation and writing independent critiques using a different client. The process is slower but I believe it's made me faster and producing more reliable code.

What Changed in the Plugin

The original plugin tracked features in a JSON backlog file. That worked, but it was more structure than necessary. The current version uses file presence as the state machine:

docs/features/dark-mode/
├── idea.md       # exists = backlog
├── plan.md       # exists = in-progress
└── shipped.md    # exists = completed

A feature with only idea.md is in the backlog. Add plan.md and it's in progress. Add shipped.md and it's done. No status fields to update, no JSON to parse. The PostToolUse hook watches for these files and regenerates a DASHBOARD.md automatically.

The commands stayed mostly the same — /feature-capture, /feature-plan, /feature-ship. The addition is /feature-submit, which handles the handoff to external reviewers.

The Adversarial Review Process

I use two reviewer options: a Gemini CLI skill and a Codex plugin. I typically use one reviewer per project, not both — having options for different environments matters. The core principle is that a different model reviewing the implementation catches things the implementing model doesn't, because it isn't agreeable to its own work.

Both reviewers operate under strict constraints. They're read-only — no modifying source code. They have explicit no-code enforcement so they can't drift from reviewing into implementing (even though they want to). And they post their feedback directly on the GitHub PR — structured reviews with a verdict (pass, conditional-pass, fail) plus inline comments on specific lines of code.

The gemini-reviewer defines itself as a Senior Software Architect and Security Engineer:

## Mandates

1.  **READ-ONLY:** You MUST NOT modify any source code. Your only
    permitted actions are reading code and posting PR reviews.
2.  **NO-CODE ENFORCEMENT:** You are a Reviewer, not an Implementer.
    You must NEVER start implementing fixes — only document what
    needs to change.
3.  **CONSTRUCTIVE CRITIQUE:** Every finding must be actionable.
    Explain why it is a risk and how it should be addressed.
4.  **PR-BASED OUTPUT:** Post all feedback as GitHub PR reviews
    and comments using `gh` CLI.

The codex-reviewer has three separate review skills for different phases — plan review, implementation review, and shipped review. The implementation review focuses on correctness bugs, plan drift, missing tests, and accidental scope expansion.

Reviews land on the PR as structured comments. For specific code issues, reviewers post inline comments on the relevant lines:

gh pr review <pr-number> --comment --body "## Gemini Review

### Verdict: CONDITIONAL PASS

### Critical Findings
- combat_start event overwrites blank state before listener
  registers — race condition on fast connections

### Recommendations
- Add integration test for sub-100ms combat start

### Areas of Concern Response
- Sidebar regression confirmed fixed — verified via Playwright"

# Inline comment on a specific line
gh api repos/{owner}/{repo}/pulls/<pr-number>/comments \
  --method POST \
  --field body="Race condition: this listener registers after combat_start fires" \
  --field commit_id="$(gh pr view <pr-number> --json headRefOid --jq '.headRefOid')" \
  --field path="src/combat/blank-state.ts" \
  --field line=42

The Round Cycle

A typical feature goes through 3-4 review rounds. The cycle works like this:

When implementation is ready, I run /feature-submit <id>. The skill creates a feature branch off dev, commits the work, and opens a draft PR. It auto-generates a PR description from the git diff and plan progress, then asks me to edit it — particularly the "Areas of Concern" section where I flag what I'm uncertain about. The PR description becomes the review context.

flowchart TD
  subgraph Plan
      A[Capture] --> B[Plan]
  end

  B --> C[Implement]

  subgraph Review
      C --> D[Submit PR]
      D --> E{Reviewer<br/>Verdict}
      E -->|pass| F[Ship]
      E -->|fail| G[Respond]
      G --> D
  end

  F --> H[Merge to dev]

In a separate terminal, I activate the reviewer skill. It reads the PR description and diff using gh, analyzes the implementation against the plan, and posts its review directly on the PR — both a top-level verdict and inline comments on specific lines. When I'm ready to address the feedback, I run /feature-submit <id> --respond. The skill reads all PR reviews and comments via the GitHub API, consolidates findings by severity, highlights anything flagged by the reviewer, and asks me which findings to address, defer, or disagree with. After implementing fixes, it commits, pushes, and adds a comment on the PR summarizing what changed.

The PR serves as the single record of the entire review conversation — each round is visible as commits and comments, not scattered across markdown files in the repo.

On one project, a feature went through a FAIL on round 1 — the reviewer caught a functional regression where a sidebar was hidden on mobile with no alternative navigation. Round 2 came back PASS after the fix, which included 25 Playwright verification screenshots as evidence. Another feature took 4 rounds — a race condition in event handling that the reviewer caught in round 1 went unaddressed in round 2's implementation, so the reviewer flagged it again. Fixed in round 3, shipped in round 4.

When reviews are satisfactory, /feature-ship writes shipped.md to the feature branch, merges the PR into dev, and cleans up the branch — both local and remote.

The Backlog as a Decomposition Tool

Capturing ideas isn't just task tracking. It's how I manage complexity in real time.

When I'm implementing a feature and the work starts getting tangled — too many concerns, unclear boundaries — I split off a smaller component and add it to the backlog with /feature-capture. That idea gets its own idea.md with the problem statement and context while it's fresh. Then I go back to the current feature with a cleaner scope.

The backlog grows and shrinks over the life of a project. It starts small, expands as I discover the real scope of the work, then contracts as features ship. On a current project I have 115 features tracked — 72 shipped, 15 in the backlog, a few in progress. The shape looks like a bell curve over time. With the added categories component, it's possible track coding items and business items separately. I'll often have one terminal open working on coding tasks and another working on business related tasks.

Scope creep prevention is a big part of why this works. When the scope-guard skill flags something as outside the current feature's boundaries, I have a choice: include it and expand scope, or add it to the backlog and stay focused. Most of the time I add it to the backlog. Some backlog items are legitimate decompositions of complex work. Others are spur-of-the-moment ideas that popped into my head — a new feature, an optimization, a UI tweak. Both get captured the same way.

One Thing at a Time (Mostly)

I use separate branches per feature, not worktrees. I tend to work on one item at a time unless the two things are genuinely different — frontend work vs. backend work, business tasks vs. coding tasks. Similar work in parallel creates interference because the mental models overlap and decisions bleed between features. Dissimilar work doesn't have that problem.

The git history and feature records make context-switching possible without losing state. Each feature has its idea.md, plan.md, review history, and branch. When I come back to something after a few days, the context is reconstructable. That's the whole point of the ceremony — not to slow things down, but to create a trail I can follow back.

Slower but Faster

The review cycle adds time. Three to four rounds with an external reviewer, structured findings to address or explicitly defer, PR descriptions to write before each submission. It feels slower than just shipping.

But I've spent less time troubleshooting problems downstream. Issues that would have surfaced in testing or production — race conditions, missing edge cases, regressions — get caught during review. The CI/CD and git processes help too. Keeping a record of every decision, every review round, every deferred finding means I can always reconstruct where I am with a project and why a particular decision was made. Creating these guardrails for myself and coding agents has resulted in more and better shipped code.

This workflow works for me because it matches how I think about decomposition and focus. Small pieces, one at a time, with an external check before shipping. Someone who works differently — who thrives on rapid iteration without structure, or who prefers to hold the full picture in their head — might build a different harness. The point isn't to use this system. It's to find what makes you more effective and build structure around it.

Source: claude-code-plugins, gemini-reviewer, codex-reviewer