Automating multi-agent, multi-model planning and code reviews with AI skills and Git worktrees. A practical approach to agentic software development workflows.
- Context
- Automating multi-model plan reviews
- The skill: Iterative plan review
- Invocation
- Controlling arguments
- In practice
- Alternative approaches
- Multi-model vs orchestration
- Bash exec vs tmux
- Automating multi-model implementation and code reviews
- The skill: Iterative implementation
- Invocation
- In practice
- Orchestration: putting it all together
- Differences from guided workflow
Context
Continuing the series on learning and improving AI workflows. This post explores how Git worktrees and agent skills augment and automate software development workflows discussed in earlier entries in this series. I share notes on automating multi-model iteration during planning, extending that idea into code implementation, and using Git worktrees to keep code changes organized.
Automating multi-model plan reviews
In On working with agentic models > Improve specs by having models critique each other, I describe a process of having Claude produce plans and Codex review them. This process has caused a significant jump in the quality of the output I get because Codex is a shrewd reviewer that’s good at calling out under-specification.
Previously, I'd been handling this exchange manually by prompting Codex to read the updated plan from disk and copying the feedback into Claude. After seeing several comments from developers about successfully automating this step (e.g., [1], [2]), I decided to ask Claude to write a skill that allows Claude Code to exchange feedback with Codex.
The skill: Iterative plan review
The skill definition can be found at: https://github.com/kareemf/claude-skills/blob/main/iterative-plan-review/SKILL.md. At a high level, the skill implements the following loop:
Invocation
Invoke the skill with a slash command /iterative-plan-review for Claude. For Codex, it’s $iterative-plan-review ($ instead of / ). For instance, you can use a slash command to point the skill at an existing file:
/iterative-plan-review Review and refine the plan in specs/open-sourcing.mdOr reference the skill name in context:
Review specs/open-sourcing.md using the iterative-plan-review skillAlso implicitly via natural language:
I have a plan in specs/open-sourcing.md - can you have Codex review it and iterate until it's solid?Lastly, if you’ve just finished developing a plan in a context window, you can invoke the skill without any arguments, and it will implicitly pick up the spec with just the base command:
/iterative-plan-reviewControlling arguments
Arguments can be set using natural language, for instance:
/iterative-plan-review Review specs/open-sourcing.md with max 3 iterations, use high reasoning effortIn practice
As always, the earlier incorrect assumptions, missing requirements, or excess complexity are identified, the cheaper they are to correct. Adding a multi-model review of specifications helps to spot those issues in the design phase, increasing the relative safety of delegating implementation.
In practice, this extra layer of review helps to surface requirements that I may not think of up front. For example, this process flagged App Store rejection risks and mitigations tied to features in an app I’m working on. As someone working towards app submission for the first time, those risks were certainly blind spots for me.
It’s cool to see a message like this at the end, along with a readout of the number and nature of improvements to the plan:
✅ Plan approved by Codex after 2 iterations. Let me add the final reviewer notes to the plan, then present for your approval.Alternative approaches
Multi-model vs orchestration
It's a testament to the pace of change that the release of team orchestration for Claude Code called into question whether this post was even still relevant. Ultimately, I decided that this post still has legs because Git worktrees and custom skills are useful tools that are portable across workflows. My approach focuses on inter-model communication, which I imagine complements intra-model orchestration – though only experimentation will confirm that.
Bash exec vs tmux
An alternative approach that I considered and that seems to be popular is to use tmux to spawn a pane for each agent, letting them communicate with each other directly via CLI.
The main benefit is that you retain direct interactivity and observability over each agent. For instance, you'd be able to attach to running panes and modify the context, switch models, etc.
But with the bash-based approach, what you lose in observability and interaction you gain in simplicity. For instance, you don't have to worry about modes of failure for sending input to tmux.
Once I had a working implementation that provided clear value, the next logical step was to extend the multi-agent paradigm from planning to implementation.
Automating multi-model implementation and code reviews
Shifting gears from planning to implementation, this is where Git worktrees become especially useful.
The skill: Iterative implementation
The skill definition can be found at: https://github.com/kareemf/claude-skills/blob/main/iterative-implementation/SKILL.md. Here is the loop that the skill follows:
Invocation
The /iterative-implementation slash command can be pointed at a plan/spec file:
/iterative-implementation for @specs/open-sourcing.md with Claude as the implementer and Codex as the reviewer. Set max review iterations to 10. Auto-approve task breakdownsIn practice
If a spec does not have a Markdown-formatted task list, Claude/Opus 4.5 proposes a task breakdown
With the Claude/Opus 4.5 implementer, Codex/GPT 5.2 reviewer duo, I found that GPT was good about flagging dead code and stale comments that Opus would leave behind. Opus was more likely to make explicit review commits while GPT was more inclined to amend existing commits to apply feedback. I hadn’t directed the models to take one approach over the other, but I could see it being helpful to bake a preference into the skill for adding review commits rather than amending existing ones. More granular commits could make human review/triage easier, and commits can always be squashed if they prove to be too verbose.
The branching strategy that Codex / GPT 5.2-codex went with was effectively what I had in mind. Codex / Opus 4.5 needed a reminder to use the --no-ff flag – another preference that could be baked into the skill definition.
Orchestration: putting it all together
Tying together agentic planning, implementation, and review, here is an overview of what the full process looks like:
- Planning
- Human drafts initial spec or problem statement
- Human + planning agent iterate
- Planning agent + review agent further iterate. Repeat until…
- Human approves final plan
- Implementation
- Implementation agent writes code/performs task
- Review agent provides structured feedback
- Agents iterate until either ready for human review or human intervention
- Human reviews, either reinitializes the loop with additional context or accepts
Differences from guided workflow
In On working with agentic models, part II, I mentioned that compacting or starting new sessions after milestones both improves output quality and reduces token usage. In this more automated process, where models can work on a sequence of tasks for extended periods of time without human intervention, the manual compacting step largely goes away. I wonder, then, if an ephemeral reviewer offsets the potential drift in quality of a long-running implementer. My anecdotal experience has been that it does pan out.