My last Spec Kit article answered a brownfield question:
How do I reverse-engineer one existing feature into usable Spec Kit artifacts?
The next question is the one I kept hearing right after that:
Should every Spec Kit phase use the same model?
My working hypothesis was:
- use a fast model for
constitution,specify, andclarify - use a stronger reasoning model for
plan,tasks, andanalyze - use a stronger coding model for
implement
That part still holds up.
But when I rechecked the current vendor docs on April 14, 2026, the exact stack I had in mind needed clearer naming and role separation:
- Google’s current lineup makes the split clearer: Gemini 3 Flash Preview for the newer fast slot and Gemini 3.1 Pro Preview for Google’s Pro-class, software-engineering-oriented work
- OpenAI’s current public naming is GPT-5.4 or GPT-5.4 pro with a reasoning effort setting like
xhigh, not a model literally calledgpt-5.4-xhigh - Anthropic’s current docs describe Claude Opus 4.6 and Claude Sonnet 4.6, and Claude Code documents an
opusplanmode that uses Opus for planning and Sonnet for execution
So this article is not “I proved my exact original hypothesis was perfectly right.”
It is a more useful follow-up:
the phase-splitting idea is strong, but the exact model mix should be updated for 2026.
TL;DR
- Yes, I think phase-specific model routing is the right mental model for GitHub Spec Kit.
- No, I would not keep the original stack exactly as written.
- My current default stack is:
constitution,specify,clarifywith Gemini 3 Flash Previewplan,tasks,analyzewith GPT-5.4 athighorxhigheffortimplementwith Claude Sonnet 4.6 by default, escalating to Claude Opus 4.6 when the implementation is unusually architecture-sensitive
- If you want a cleaner Anthropic-native option, Claude Code’s documented
opusplansplit is one of the strongest alternatives:- Opus for planning
- Sonnet for execution
- The most important Google nuance is simple:
- use Gemini 3 Flash Preview for the fast slot
- use Gemini 3.1 Pro Preview when you want Google’s Pro-class option
Who This Is For
This article is for teams that:
- already use, or want to use, GitHub Spec Kit as a real workflow instead of a one-off prompt
- are comfortable working in Cursor or another model-switching environment
- care enough about planning quality to tolerate some operational complexity
- want a stronger answer than “just use the same model for everything”
This article is probably not for teams that:
- want the fewest possible moving parts
- do not want to manage multiple providers, billing paths, or model-specific behavior
- are still early enough in adoption that one strong default model would be easier to socialize
If that is your situation, skip ahead to Alternative 3: Maximum simplicity.
The Recommendation in One View
If you only want the answer before the evidence, this is the version I would hand to an engineering team today:
| Spec Kit phase | My 2026 default | Why |
|---|---|---|
constitution | Gemini 3 Flash Preview | Fast repo-level synthesis |
specify | Gemini 3 Flash Preview | Fast first draft of feature intent |
clarify | Gemini 3 Flash Preview | Quick ambiguity reduction |
plan | GPT-5.4 (high or xhigh) | Better architecture reasoning |
tasks | GPT-5.4 (high or xhigh) | Better decomposition and dependency quality |
analyze | GPT-5.4 (high or xhigh) | Better cross-artifact consistency checking |
implement | Claude Sonnet 4.6 | Better speed-to-quality default for code execution |
implement for hard cases | Claude Opus 4.6 | Premium option for harder coding and repair loops |
What I Can Actually Prove Here
I want to be careful with the word “proof.”
What I can prove from the current docs is:
- Spec Kit phases are distinct enough that routing them differently is operationally reasonable
- Cursor supports model-aware modes, which makes per-phase routing practical inside one IDE workflow
- Google positions Gemini 3 Flash as a fast 3-series model with Pro-level intelligence at Flash speed/pricing
- Google positions Gemini 3.1 Pro Preview around software engineering and agentic workflows
- OpenAI positions GPT-5.4 around deeper reasoning and complex professional work
- Anthropic explicitly productized a planner/executor split in Claude Code with
opusplan
What I did not do here is run a lab-grade benchmark with identical prompts, fixed repositories, and scored outputs across all models.
So treat this article as source-backed workflow research, not as a leaderboard benchmark paper.
Why a Single Model for Every Phase Is Usually the Wrong Default
GitHub Spec Kit phases do not ask the model to do the same kind of work.
/speckit.constitution, /speckit.specify, and /speckit.clarify are usually about:
- extracting invariants
- turning vague intent into structured requirements
- spotting ambiguity quickly
- producing clean artifacts without overthinking every line
/speckit.plan, /speckit.tasks, and /speckit.analyze are heavier:
- architecture tradeoffs
- dependency ordering
- coverage gaps
- internal consistency across multiple artifacts
/speckit.implement is different again:
- code generation
- file-level changes
- test updates
- repair loops
- local tool interaction
That is why one-model-for-everything often feels either:
- too slow and expensive early, or
- too shallow later
The docs from the vendors line up with that intuition more than I expected.
What the Official Docs Support
1. Spec Kit already separates the workflow into phases
The current Spec Kit README documents the core flow as:
/speckit.constitution/speckit.specify/speckit.plan/speckit.tasks/speckit.implement
It also documents optional commands like:
/speckit.clarify/speckit.analyze/speckit.checklist
That matters because model routing only makes sense if the workflow is already phase-based. Spec Kit is.
2. Cursor makes phase routing operationally plausible
Cursor’s docs on modes and model selection are the missing operational layer here.
The current docs say:
- custom modes can have their own model
- custom modes can have their own instructions
- once you find prompt and model combinations that work well, you can save them as Custom Modes
That is basically the infrastructure you need for:
- a
Spec Draftmode - a
Deep Planmode - an
Implementmode
One important caveat from Cursor’s API key docs: provider availability depends on the exact API-key path and setup you use. So if your model-routing design depends on bring-your-own-provider behavior, verify your actual Cursor setup before standardizing on it team-wide.
3. Google’s current docs support both the fast Gemini 3 Flash path and the Pro-class Gemini 3.1 path
This was the part of the stack that needed the most naming clarity.
On April 14, 2026, Google’s official model catalog says:
- Gemini 3 Flash is Google’s latest 3-series Flash model, with Pro-level intelligence at the speed and pricing of Flash
- Gemini 3.1 Pro Preview provides better thinking, improved token efficiency, and a more grounded experience, and Google says it is optimized for software engineering and agentic workflows
At the same time, Google’s current model descriptions say:
- Gemini 3 Flash Preview is the better fit when the early phases need fast drafting, quick ambiguity reduction, and a newer fast frontier model from Google
- Gemini 3.1 Pro Preview is the better fit when you want Google’s heavier Pro-class option for harder clarification, repo synthesis, or other software-engineering-heavy passes
That is strong evidence for the broader workflow idea:
- yes, use Google for fast early-phase drafting if you like that part of the stack
- yes, if you specifically want Google’s Pro-class option, use Gemini 3.1 Pro Preview
4. OpenAI’s docs strongly support the “deep planning model” part
OpenAI’s current model docs say:
- GPT-5.4 is the best intelligence at scale for agentic, coding, and professional workflows
- GPT-5.4 pro is a more precise version that can think harder and may take several minutes on harder requests
xhighis a supported reasoning effort, not a separate model family
That maps well to:
plantasksanalyze
These phases are not just about generating text. They are about holding more constraints in mind at once and producing internally consistent artifacts.
This is also the place where I would spend more latency budget on purpose.
5. Anthropic gives the strongest evidence for a planner/executor split
Anthropic’s current docs gave me the most interesting confirmation.
Their public Claude 4.6 docs describe:
- Claude Opus 4.6 as their most intelligent model for building agents and coding
- Claude Sonnet 4.6 as the best combination of speed and intelligence
But Claude Code’s own model configuration docs go further and document opusplan:
- use Opus during plan mode
- switch to Sonnet for execution
That is a real productized version of the same idea I was exploring manually.
Anthropic’s Claude Code GitHub Actions docs reinforce the same pattern in a simpler way:
- GitHub Actions default to Sonnet
- you explicitly opt into Opus 4.6 when you want the heavier model
So if you wanted one sentence of “proof” that multi-model phase routing is not a weird personal habit, this is probably the strongest one:
Anthropic already ships a planning/execution split as a named mode.
The Operational Tax of Doing This
This is the part that can get lost if the article only talks about model quality.
Routing phases across providers is not free. It adds:
- more API-key and provider setup
- more billing surfaces to monitor
- more context handoff between phases
- more workflow discipline so engineers know when to switch models
- more debugging overhead when results are bad and the team has to ask whether the issue was the prompt, the phase, the model, or the handoff
That means I would only recommend a cross-provider stack like this when at least one of these is true:
- planning quality is expensive to get wrong
- the team already uses multiple providers
- early-phase speed matters enough to justify the extra setup
If none of that is true, a simpler one-provider workflow will often beat a clever multi-model workflow in real life.
My Updated Recommendation
If I were standardizing this for a team today, I would update the original hypothesis to this:
Stage 1: Constitution, Specify, Clarify
Default:
- Gemini 3 Flash Preview
Why:
- fast enough to keep iteration tight
- positioned by Google as a newer 3-series Flash model with Pro-level intelligence at Flash speed/pricing
- good fit for turning repo evidence and product intent into clean first-draft artifacts
When I would go lighter:
- use a lighter Flash-class option such as Gemini 2.5 Flash-Lite when cost and speed matter more than nuance
When I would go heavier:
- use Gemini 3.1 Pro Preview if you want Google’s newer Pro-class option for harder clarification or repo synthesis passes
- use GPT-5.4 mini or Claude Sonnet 4.6 if you want to stay closer to the provider you already trust operationally
Stage 2: Plan, Tasks, Analyze
Default:
- GPT-5.4 with
highorxhighreasoning effort
Escalation:
- GPT-5.4 pro when the architecture is unusually expensive to get wrong and extra latency is acceptable
Why:
- these are the phases where deeper reasoning usually pays for itself
- this is where artifact consistency matters most
- this is where weak reasoning shows up later as rework, missing dependencies, or broken acceptance logic
One subtle but important correction:
- I would describe this setup as GPT-5.4 at
xhigheffort - I would not describe it as a model called
gpt-5.4-xhigh
Stage 3: Implement
Default:
- Claude Sonnet 4.6
Escalation:
- Claude Opus 4.6 when the change is especially risky, architecture-heavy, or likely to require more autonomous repair loops
Why I changed my mind here:
- my original instinct was “use Opus for implementation because it is the strongest coding model”
- but Anthropic’s own Claude Code docs suggest a more nuanced operating model:
- Opus for planning
- Sonnet for execution
- Anthropic’s GitHub Actions docs also default automated execution to Sonnet unless you explicitly switch to Opus 4.6
That does not mean Opus is a bad implementation model.
It means I would not treat Opus as the universal default executor anymore. I would treat it as the premium escalation path for harder codegen and harder repair loops.
How I Would Run This in Practice
Inside Cursor, I would make the routing visible instead of keeping it in my head.
I would create three modes:
Here is one concrete example.
Imagine I am working on a brownfield SaaS repo and want to document, then extend, an existing invoice export flow:
- I would use Gemini 3 Flash Preview for
constitution,specify, andclarifyso I can quickly turn repo evidence and product notes into clean artifacts. - I would switch to GPT-5.4 at
highorxhighforplan,tasks, andanalyzeso the architecture, dependency ordering, and consistency checks are stronger. - I would hand
implementto Claude Sonnet 4.6 for the default coding pass and only escalate to Claude Opus 4.6 if the implementation starts touching riskier cross-cutting areas than expected.
That is the kind of workflow I mean throughout this article: not model-switching for its own sake, but deliberate switching because the work itself changes shape.
1. Spec Draft
Use for:
/speckit.constitution/speckit.specify/speckit.clarify
Instruction focus:
- stay grounded in repo evidence
- optimize for clarity and flow
- do not over-engineer
2. Deep Plan
Use for:
/speckit.plan/speckit.tasks/speckit.analyze
Instruction focus:
- optimize for internal consistency
- preserve requirements exactly
- expose risks and dependencies explicitly
3. Implement
Use for:
/speckit.implement- repair loops after tests or runtime failures
Instruction focus:
- make minimal, verifiable code changes
- keep implementation aligned to
spec.md,plan.md, andtasks.md - prefer passing tests and small diffs over cleverness
The important part is not the labels. The important part is that engineers can see:
- which phase they are in
- which model is expected there
- why that model was chosen
That lowers drift a lot.
Alternatives I Would Seriously Consider
The goal is not to force one exact stack. The goal is to choose a stack that matches your team’s priorities.
If I had to reduce the decision down to one quick rubric, it would be:
| If your priority is… | Start with… |
|---|---|
| lowest workflow complexity | Alternative 3: Maximum simplicity |
| easiest vendor consistency | Alternative 1: Anthropic-first |
| strongest planning depth | Alternative 2: OpenAI-heavy |
| latest Google Pro-class option | Alternative 5: Google-first updated |
| fastest early-phase drafting with deeper planning later | the default split in this article |
Alternative 1: Anthropic-first and simpler
constitution,specify,clarifywith Claude Sonnet 4.6planwith Claude Opus 4.6implementwith Claude Sonnet 4.6
Best when:
- you want less provider switching
- you already use Claude Code heavily
- you want something close to Anthropic’s documented
opusplanphilosophy
Alternative 2: OpenAI-heavy and planning-centric
- early phases with GPT-5.4 mini
- planning phases with GPT-5.4
- hardest planning cases with GPT-5.4 pro
Best when:
- you want to stay closer to one provider
- your biggest pain is planning quality, not raw implementation speed
- you are comfortable paying more latency for better artifact quality
Alternative 3: Maximum simplicity
- one strong general model for every phase
Best when:
- your team will not actually maintain a multi-model workflow
- operational simplicity matters more than squeezing out phase-by-phase gains
This is the alternative I would choose for most teams before I choose a clever but brittle orchestration.
Alternative 4: OpenAI coding alternative outside the Cursor default
If you are building a broader coding workflow beyond just Cursor model selection, OpenAI’s current docs also position GPT-5.3-Codex as their most capable agentic coding model.
That makes it a real alternative for the execution side if your stack leans OpenAI more than Anthropic.
I would still separate:
- drafting
- planning
- implementation
But the implementation slot does not have to belong to Claude if your tooling and costs point elsewhere.
Alternative 5: Google-first updated
constitution,specify,clarifywith Gemini 3 Flash Preview- harder clarification or planning passes with Gemini 3.1 Pro Preview
implementwith your preferred executor, usually Claude Sonnet 4.6 or another coding-first model
Best when:
- you want to stay closer to Google’s current model family
- you specifically want Google’s current Pro-class model in the stack
- you like the default article logic of fast early drafting, but want a newer Google Pro-class escalation path
What Changed My Mind Most
Three things:
1. The hypothesis was right at the workflow level
The docs support the larger pattern:
- fast model early
- deeper model for planning
- coding model for execution
That part is stronger after the review, not weaker.
2. The exact Google role mattered more than I expected
This was the sharpest clarification.
The useful Google split is not “one Gemini model everywhere.” It is:
- Gemini 3 Flash Preview for speed-sensitive drafting
- Gemini 3.1 Pro Preview for heavier Google-native passes
That is exactly why I now like dating these articles explicitly.
3. Anthropic’s opusplan is the cleanest alternative to manual routing
I expected to end up with a more custom conclusion.
Instead, Anthropic’s docs point to something simpler:
- use the stronger model for planning
- use the faster still-strong model for execution
That is a very good default pattern even if you never copy my exact stack.
My Final Take
If you want the shortest possible answer, it is this:
- the idea behind the original hypothesis is correct
- the exact model names should be updated for 2026
So the version I would actually hand to an engineering team today is:
- Gemini 3 Flash Preview for
constitution,specify, andclarify - GPT-5.4 at
highorxhighforplan,tasks, andanalyze - Claude Sonnet 4.6 for default
implement, with Claude Opus 4.6 reserved for harder implementation cases
If the team wants the newest Google Pro-class option in the stack, I would use Gemini 3.1 Pro Preview alongside Gemini 3 Flash Preview rather than collapsing the whole Google side into one model.
And if the team wants something simpler and more vendor-consistent, I would look hard at an Anthropic-first setup built around the same principle Claude Code already documents in opusplan.
That is probably the most important lesson from this research:
the winning move is not one magical model. It is matching the model to the phase.
Source List
- GitHub Spec Kit README
- Cursor Docs: Modes
- Cursor Docs: Selecting Models
- Cursor Docs: API Keys
- Google AI Docs: Models
- Google AI Docs: Gemini 3
- Google AI Docs: Gemini 3 Flash Preview
- Google AI Docs: Gemini 3.1 Pro Preview
- OpenAI Models Overview
- OpenAI GPT-5.4
- OpenAI GPT-5.4 pro
- OpenAI GPT-5.3-Codex
- Claude Code Docs: Model configuration
- Claude API Docs: What’s new in Claude 4.6
- Claude Code Docs: GitHub Actions