On working with agentic models, part III: Multi-model planning and code reviews

On working with agentic models, part III: Multi-model planning and code reviews

Tags
CodingGenAI
Published
March 1, 2026
Description

Automating multi-agent, multi-model planning and code reviews with AI skills and Git worktrees. A practical approach to agentic software development workflows.

Context

Continuing the series on learning and improving AI workflows. This post explores how Git worktrees and agent skills augment and automate software development workflows discussed in earlier entries in this series. I share notes on automating multi-model iteration during planning, extending that idea into code implementation, and using Git worktrees to keep code changes organized.

Automating multi-model plan reviews

In On working with agentic models > Improve specs by having models critique each other, I describe a process of having Claude produce plans and Codex review them. This process has caused a significant jump in the quality of the output I get because Codex is a shrewd reviewer that’s good at calling out under-specification.

Previously, I'd been handling this exchange manually by prompting Codex to read the updated plan from disk and copying the feedback into Claude. After seeing several comments from developers about successfully automating this step (e.g., [1], [2]), I decided to ask Claude to write a skill that allows Claude Code to exchange feedback with Codex.

The skill: Iterative plan review

The skill definition can be found at: https://github.com/kareemf/claude-skills/blob/main/iterative-plan-review/SKILL.md. At a high level, the skill implements the following loop:

Invocation

Invoke the skill with a slash command /iterative-plan-review for Claude. For Codex, it’s $iterative-plan-review ($ instead of / ). For instance, you can use a slash command to point the skill at an existing file:

/iterative-plan-review Review and refine the plan in specs/open-sourcing.md

Or reference the skill name in context:

Review specs/open-sourcing.md using the iterative-plan-review skill

Also implicitly via natural language:

I have a plan in specs/open-sourcing.md - can you have Codex review it and iterate until it's solid?

Lastly, if you’ve just finished developing a plan in a context window, you can invoke the skill without any arguments, and it will implicitly pick up the spec with just the base command:

/iterative-plan-review

Controlling arguments

Arguments can be set using natural language, for instance:

/iterative-plan-review Review specs/open-sourcing.md with max 3 iterations, use high reasoning effort

In practice

As always, the earlier incorrect assumptions, missing requirements, or excess complexity are identified, the cheaper they are to correct. Adding a multi-model review of specifications helps to spot those issues in the design phase, increasing the relative safety of delegating implementation.

In practice, this extra layer of review helps to surface requirements that I may not think of up front. For example, this process flagged App Store rejection risks and mitigations tied to features in an app I’m working on. As someone working towards app submission for the first time, those risks were certainly blind spots for me.

It’s cool to see a message like this at the end, along with a readout of the number and nature of improvements to the plan:

✅ Plan approved by Codex after 2 iterations. Let me add the final reviewer notes to the plan, then present for your approval.
image

Alternative approaches

Multi-model vs orchestration

It's a testament to the pace of change that the release of team orchestration for Claude Code called into question whether this post was even still relevant. Ultimately, I decided that this post still has legs because Git worktrees and custom skills are useful tools that are portable across workflows. My approach focuses on inter-model communication, which I imagine complements intra-model orchestration – though only experimentation will confirm that.

Bash exec vs tmux

An alternative approach that I considered and that seems to be popular is to use tmux to spawn a pane for each agent, letting them communicate with each other directly via CLI.

The main benefit is that you retain direct interactivity and observability over each agent. For instance, you'd be able to attach to running panes and modify the context, switch models, etc.

But with the bash-based approach, what you lose in observability and interaction you gain in simplicity. For instance, you don't have to worry about modes of failure for sending input to tmux.

Once I had a working implementation that provided clear value, the next logical step was to extend the multi-agent paradigm from planning to implementation.

Automating multi-model implementation and code reviews

Shifting gears from planning to implementation, this is where Git worktrees become especially useful.

The skill: Iterative implementation

The skill definition can be found at: https://github.com/kareemf/claude-skills/blob/main/iterative-implementation/SKILL.md. Here is the loop that the skill follows:

Invocation

The /iterative-implementation slash command can be pointed at a plan/spec file:

/iterative-implementation for @specs/open-sourcing.md with Claude as the implementer and Codex as the reviewer. Set max review iterations to 10. Auto-approve task breakdowns

In practice

If a spec does not have a Markdown-formatted task list, Claude/Opus 4.5 proposes a task breakdown

image

With the Claude/Opus 4.5 implementer, Codex/GPT 5.2 reviewer duo, I found that GPT was good about flagging dead code and stale comments that Opus would leave behind. Opus was more likely to make explicit review commits while GPT was more inclined to amend existing commits to apply feedback. I hadn’t directed the models to take one approach over the other, but I could see it being helpful to bake a preference into the skill for adding review commits rather than amending existing ones. More granular commits could make human review/triage easier, and commits can always be squashed if they prove to be too verbose.

The branching strategy that Codex / GPT 5.2-codex went with was effectively what I had in mind. Codex / Opus 4.5 needed a reminder to use the --no-ff flag – another preference that could be baked into the skill definition.

image

Orchestration: putting it all together

Tying together agentic planning, implementation, and review, here is an overview of what the full process looks like:

  • Planning
    • Human drafts initial spec or problem statement
    • Human + planning agent iterate
    • Planning agent + review agent further iterate. Repeat until…
    • Human approves final plan
  • Implementation
    • Implementation agent writes code/performs task
    • Review agent provides structured feedback
    • Agents iterate until either ready for human review or human intervention
    • Human reviews, either reinitializes the loop with additional context or accepts

Differences from guided workflow

In On working with agentic models, part II, I mentioned that compacting or starting new sessions after milestones both improves output quality and reduces token usage. In this more automated process, where models can work on a sequence of tasks for extended periods of time without human intervention, the manual compacting step largely goes away. I wonder, then, if an ephemeral reviewer offsets the potential drift in quality of a long-running implementer. My anecdotal experience has been that it does pan out.