My last Spec Kit article answered a brownfield question:
How do I reverse-engineer one existing feature into usable Spec Kit artifacts?
The next question is the one I kept hearing right after that:
Should every Spec Kit phase use the same model?
My working hypothesis was:
- use a fast model for
constitution,specify, andclarify - use a stronger reasoning model for
plan,tasks, andanalyze - use a stronger coding model for
implement
That part still holds up.
But when I rechecked the current vendor docs on April 14, 2026, the exact stack I had in mind needed clearer naming and role separation:
- Google’s current lineup makes the split clearer: Gemini 3 Flash Preview for the newer fast slot and Gemini 3.1 Pro Preview for Google’s heavier Pro-class work
- OpenAI’s current public naming is GPT-5.4 or GPT-5.4 pro with a reasoning effort setting like
xhigh, not a model literally calledgpt-5.4-xhigh - Anthropic’s current docs describe Claude Opus 4.6 and Claude Sonnet 4.6, and Claude Code documents an
opusplanmode that uses Opus for planning and Sonnet for execution
So this article is not “I proved my exact original hypothesis was perfectly right.”
It is a more useful follow-up:
the phase-splitting idea is strong, but the exact model mix should be updated for 2026.
TL;DR
- Yes, I still think phase-specific model routing is the right mental model for GitHub Spec Kit.
- No, I no longer think the best article is just “fast model here, smart model there, coding model there.”
- The more useful question is: what breaks first when you actually run this workflow?
- The common complaints are surprisingly consistent:
- Gemini can feel too creative unless you ground it hard
- GPT-5.4 can become too slow if you let it reason deeply on every loop
- Claude Opus is powerful, but often too slow to be the default executor
- The most interesting new variable is Composer 2 Fast inside Cursor:
- Cursor says Composer 2 has frontier-level coding performance
- Cursor says the faster variant has the same intelligence, lower cost than other fast models, and is now the default fast option
- My current practical default stack is:
constitution,specify,clarifywith Composer 2 Fast inside Cursorplan,tasks,analyzewith GPT-5.4 atmediumorhigh, notxhighby defaultimplementwith Claude Sonnet 4.6 by default, escalating to Claude Opus 4.6 only when the code path is unusually hard
- I still think Gemini 3 Flash Preview and Gemini 3.1 Pro Preview are useful, but I now see them as grounded Google paths, not my default answer for every fast loop
Who This Is For
This article is for teams that:
- already use, or want to use, GitHub Spec Kit as a real workflow instead of a one-off prompt
- are comfortable working in Cursor or another model-switching environment
- care enough about planning quality to tolerate some operational complexity
- want a stronger answer than “just use the same model for everything”
This article is probably not for teams that:
- want the fewest possible moving parts
- do not want to manage multiple providers, billing paths, or model-specific behavior
- are still early enough in adoption that one strong default model would be easier to socialize
If that is your situation, skip ahead to Alternative 4: Maximum simplicity.
What Keeps Breaking in Practice
This is the section I think the earlier version underplayed.
Theoretical strengths matter, but recurring failure modes matter more.
1. Gemini can drift unless you ground it hard
This is the complaint I hear most often about Gemini in spec-first workflows.
Not “Gemini is useless.”
More like:
- the draft is fast
- the prose sounds good
- but some of it is too confident relative to the repo evidence
Google’s own Gemini API docs are a useful reminder here. Their safety guidance explicitly says large language models can generate text that is factually incorrect. Their grounding docs say grounding with Google Search can reduce hallucinations and improve factual accuracy. Their structured outputs docs also warn that even schema-valid output may still fail your business logic checks.
So my updated read is:
- Gemini is still attractive for speed
- but it is safest when you force grounding, evidence-driven prompts, and post-generation validation
That means I no longer love Gemini as an unguarded default for final artifact drafting.
2. GPT-5.4 can reason too much for inner loops
I still like GPT-5.4 a lot for planning.
But the common complaint is also real:
- if you use GPT-5.4 on every draft loop
- or if you jump straight to
high/xhightoo often - the workflow can become noticeably slower than it needs to be
OpenAI’s current docs support that tradeoff. They position GPT-5.4 as the flagship for complex reasoning and coding, while also explicitly recommending GPT-5.4 mini and GPT-5.4 nano for lower-latency, lower-cost workloads. They also note that GPT-5.4 pro may take several minutes on harder requests.
So the issue is not “GPT-5.4 is bad.”
The issue is:
- it is excellent for the checkpoint
- it is often overkill for the inner loop
3. Claude Opus is too slow to be the default executor
Anthropic’s docs line up almost perfectly with the complaint here.
They describe:
- Claude Opus 4.6 as the most intelligent model for agents and coding
- Claude Sonnet 4.6 as the best combination of speed and intelligence
Claude Code also documents:
opusplan, which uses Opus for planning and Sonnet for execution- fast mode for Opus 4.6
That is a strong signal.
If Anthropic itself ships:
- Opus for the hard reasoning moments
- Sonnet for daily execution
then I do not think Opus should be the default implementation model in this article anymore.
4. Composer 2 Fast belongs in this conversation now
This is the biggest addition to the article.
Cursor’s current Composer 2 materials say:
- Composer 2 has frontier-level coding performance
- Composer 2 scored 61.3 on CursorBench, 61.7 on Terminal-Bench 2.0, and 73.7 on SWE-bench Multilingual
- the faster variant has the same intelligence
- it has lower cost than other fast models
- Cursor is making the fast variant the default option
That does not automatically make Composer 2 Fast the best model for every Spec Kit phase.
But it absolutely makes it worth analyzing as:
- the fastest coding-native loop inside Cursor
- a strong default for iterative drafting
- a serious alternative when GPT-5.4 and Opus feel too slow
The Recommendation in One View
If you only want the answer before the evidence, this is the version I would hand to an engineering team today:
| Spec Kit phase | My practical default | Why | Escalate to |
|---|---|---|---|
constitution | Composer 2 Fast | Fast Cursor-native repo synthesis | Gemini 3.1 Pro Preview or GPT-5.4 if the repo is unusually subtle |
specify | Composer 2 Fast | Fast drafting with strong coding context | Gemini 3 Flash Preview with grounding if you want Google |
clarify | Composer 2 Fast | Quick iteration without GPT-5.4 latency | Gemini 3.1 Pro Preview for harder clarification passes |
plan | GPT-5.4 (medium or high) | Better architecture reasoning | xhigh or GPT-5.4 pro only when the plan is high-stakes or stuck |
tasks | GPT-5.4 (medium or high) | Better decomposition and dependency quality | xhigh if the breakdown is still weak |
analyze | GPT-5.4 (medium or high) | Better cross-artifact consistency checking | GPT-5.4 pro for rare deep review passes |
implement | Claude Sonnet 4.6 | Best daily speed/intelligence balance | Claude Opus 4.6 for hard rescues, Composer 2 Fast for quick local loops |
What I Can Actually Prove Here
I want to be careful with the word “proof.”
What I can prove from the current docs is:
- Spec Kit phases are distinct enough that routing them differently is operationally reasonable
- Google itself documents factual inaccuracy risk, and grounding/structured output mitigations
- OpenAI itself distinguishes flagship deep reasoning from lower-latency smaller variants
- Anthropic itself distinguishes Opus from Sonnet by speed/intelligence role
- Claude Code itself productizes a plan/execute split
- Cursor now has a first-party fast coding model path worth taking seriously
What I did not do here is run a controlled benchmark with identical prompts, identical repositories, and scored outputs across all providers.
So treat this article as source-backed workflow research plus practical engineering analysis, not as a lab-grade benchmark paper.
Why the Naive Phase Map Stops Being Useful
The old version of the article mostly asked:
- which model is strongest for this phase?
The more useful version asks:
- what failure mode hurts this phase most?
For example:
constitution,specify,clarifysuffer most from slow iteration and overconfident draftingplan,tasks,analyzesuffer most from shallow reasoning and inconsistency across artifactsimplementsuffers most from latency during repair loops
Once you see the workflow that way, the stack changes.
What the Official Docs Support
1. Spec Kit still gives us the right phase boundaries
The current Spec Kit README documents the core flow as:
/speckit.constitution/speckit.specify/speckit.plan/speckit.tasks/speckit.implement
It also documents optional commands like:
/speckit.clarify/speckit.analyze/speckit.checklist
That still matters. The phases are real. The question is how aggressively you want to optimize each one.
2. Google docs support the Gemini concern and the Gemini mitigation
Google’s own Gemini API docs do not say “Gemini hallucinates too much.”
But they do say enough to make the concern real:
- Gemini output can be factually incorrect
- grounding with Google Search can reduce hallucinations
- structured output improves formatting but does not guarantee business correctness
That leads to a more careful Google recommendation:
- use Gemini 3 Flash Preview when you want the fast Google path
- use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
- but treat Gemini as much safer when the prompt is tightly grounded in repo evidence, structured output, or search grounding
3. OpenAI docs support using GPT-5.4 as a checkpoint model, not an everywhere model
OpenAI’s current docs say:
- GPT-5.4 is the flagship for complex reasoning and coding
- GPT-5.4 mini is a faster, more efficient model for high-volume workloads
- GPT-5.4 pro can take several minutes on hard requests
That maps well to:
- use GPT-5.4 for the serious planning checkpoints
- do not assume GPT-5.4 at
xhighbelongs in every loop
This is the article change I feel best about.
The issue was never that GPT-5.4 was weak.
The issue was that I was implicitly treating a checkpoint model like an iteration model.
4. Anthropic docs support Sonnet as the default executor much more than Opus
Anthropic’s current docs say:
- Claude Sonnet 4.6 is the best combination of speed and intelligence
- Claude Opus 4.6 is the most intelligent model for agents and coding
- fast mode makes Opus 4.6 up to 2.5x faster
Claude Code adds two practical workflow signals:
opusplan= Opus for planning, Sonnet for execution- model aliases explicitly position Sonnet for daily coding tasks
So if the complaint is “Claude Opus is simply too slow,” the docs are basically pointing to the same answer:
- yes, keep Opus
- no, do not make it the default execution loop
5. Cursor’s Composer 2 Fast is the most important new addition
The biggest thing missing from the earlier version of this article was Cursor’s own fast-model path.
Cursor now says:
- Composer 2 is frontier-level at coding
- Composer 2 has strong results on CursorBench and other coding benchmarks
- the faster variant has the same intelligence
- the fast variant has lower cost than other fast models
- Cursor is making the fast variant the default
Because this workflow already assumes Cursor for a lot of Spec Kit work, Composer 2 Fast deserves a first-class role in the article.
It is not automatically better than GPT-5.4 on deep planning.
But it is extremely relevant for:
- fast constitution/spec drafting
- quick clarification loops
- quick code-edit loops when Sonnet or Opus feel too slow
The Operational Tax of Doing This
This is still the part that can get lost if the article only talks about model quality.
Routing phases across providers is not free. It adds:
- more API-key and provider setup
- more billing surfaces to monitor
- more context handoff between phases
- more workflow discipline so engineers know when to switch models
- more debugging overhead when results are bad and the team has to ask whether the issue was the prompt, the phase, the model, or the handoff
That means I would only recommend a cross-provider stack like this when at least one of these is true:
- planning quality is expensive to get wrong
- the team already uses multiple providers
- early-phase speed matters enough to justify the extra setup
If none of that is true, a simpler one-provider workflow will often beat a clever multi-model workflow in real life.
My Updated Recommendation
If I were standardizing this for a team today, I would update the original hypothesis to this:
Stage 1: Constitution, Specify, Clarify
Default:
- Composer 2 Fast
Why:
- it is now the most compelling fast loop available directly inside Cursor
- Cursor positions it as a frontier coding model with a faster default variant
- fast drafting matters a lot more in these early phases than maximum reasoning depth
When I would use Google instead:
- use Gemini 3 Flash Preview when you want Google’s fast path
- use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
- in both cases, add stronger grounding and validation than I used to
Stage 2: Plan, Tasks, Analyze
Default:
- GPT-5.4 with
mediumorhigh
Escalation:
- use
xhighonly when the plan is still weak after a normal pass - use GPT-5.4 pro only when the architecture is unusually expensive to get wrong and extra latency is acceptable
Why:
- these are the phases where deeper reasoning usually pays for itself
- this is where artifact consistency matters most
- this is where weak reasoning shows up later as rework, missing dependencies, or broken acceptance logic
One subtle but important correction:
- I would describe this setup as GPT-5.4 at medium/high by default
- I would not jump to
xhighunless the task actually needs it
Stage 3: Implement
Default:
- Claude Sonnet 4.6
Secondary fast loop:
- Composer 2 Fast when I want a quick local iteration loop inside Cursor
Escalation:
- Claude Opus 4.6 when the change is especially risky, architecture-heavy, or likely to require more autonomous repair loops
Why:
- Sonnet is now the clearest “daily driver” implementation model in the official docs
- Opus is valuable, but too slow to be my default execution loop
- Composer 2 Fast is now good enough that it belongs in the practical implementation conversation too
How I Would Run This in Practice
Inside Cursor, I would stop thinking in terms of “one model per article” and start thinking in terms of inner loops and checkpoint loops.
Here is one concrete example.
Imagine I am working on a brownfield SaaS repo and want to document, then extend, an existing invoice export flow:
- I would use Composer 2 Fast for
constitution,specify, andclarifyto iterate quickly and keep the drafting loop tight. - If the draft starts feeling too loose or too code-assumptive, I would try Gemini 3 Flash Preview or Gemini 3.1 Pro Preview, but only with stronger grounding and evidence constraints.
- I would switch to GPT-5.4 at
mediumorhighforplan,tasks, andanalyzeso the architecture and cross-artifact consistency checks are stronger. - I would hand
implementto Claude Sonnet 4.6 for the default coding pass. - I would escalate only when needed:
- GPT-5.4 xhigh/pro for hard planning deadlocks
- Claude Opus 4.6 for hard implementation rescues
- Composer 2 Fast when I want a quick local coding loop without paying the Opus latency tax
That is the kind of workflow I mean throughout this article: not model-switching for its own sake, but deliberate switching because the failure mode changes.
1. Fast Draft
Use for:
/speckit.constitution/speckit.specify/speckit.clarify
Default:
- Composer 2 Fast
Instruction focus:
- stay grounded in repo evidence
- optimize for clarity and flow
- do not over-engineer
2. Deep Checkpoint
Use for:
/speckit.plan/speckit.tasks/speckit.analyze
Default:
- GPT-5.4 medium/high
Instruction focus:
- optimize for internal consistency
- preserve requirements exactly
- expose risks and dependencies explicitly
3. Daily Execution
Use for:
/speckit.implement- repair loops after tests or runtime failures
Default:
- Claude Sonnet 4.6
Instruction focus:
- make minimal, verifiable code changes
- keep implementation aligned to
spec.md,plan.md, andtasks.md - prefer passing tests and small diffs over cleverness
Alternatives I Would Seriously Consider
The goal is not to force one exact stack. The goal is to choose a stack that matches your team’s priorities.
If I had to reduce the decision down to one quick rubric, it would be:
| If your priority is… | Start with… |
|---|---|
| fastest Cursor-native loop | Alternative 1: Cursor-native speed path |
| strongest planning depth | Alternative 2: OpenAI checkpoint path |
| Google-first path with stronger grounding | Alternative 3: Grounded Gemini path |
| lowest workflow complexity | Alternative 4: Maximum simplicity |
Alternative 1: Cursor-native speed path
constitution,specify,clarifywith Composer 2 Fastplan,tasks,analyzewith GPT-5.4 medium/highimplementwith Claude Sonnet 4.6
Best when:
- you live in Cursor already
- your biggest pain is slow iteration
- you want the tightest practical loop without defaulting to GPT-5.4 or Opus everywhere
Alternative 2: OpenAI checkpoint path
- fast draft with Composer 2 Fast
- planning checkpoints with GPT-5.4
- hardest planning cases with GPT-5.4 pro
- implementation with Claude Sonnet 4.6
Best when:
- your biggest pain is planning quality, not raw implementation speed
- you are comfortable paying latency only at checkpoint moments
Alternative 3: Grounded Gemini path
constitution,specify,clarifywith Gemini 3 Flash Preview- heavier clarification or synthesis passes with Gemini 3.1 Pro Preview
- planning checkpoints with GPT-5.4
- implementation with Claude Sonnet 4.6
Best when:
- you want to stay closer to Google’s current model family
- you are willing to add grounding and validation guardrails
- you still value the speed of Gemini for drafting
Alternative 4: Maximum simplicity
- one strong general model for every phase
Best when:
- your team will not actually maintain a multi-model workflow
- operational simplicity matters more than squeezing out phase-by-phase gains
This is still the alternative I would choose for most teams before I choose a clever but brittle orchestration.
What Changed My Mind Most
Three things:
1. The right question is now failure-mode routing, not just phase routing
The old article was directionally right, but too clean.
The practical question is not only:
- which model is strongest for this phase?
It is also:
- which model fails in the most painful way for this phase?
2. GPT-5.4 and Opus are better as checkpoints than as defaults
This was the sharpest workflow lesson.
- GPT-5.4 is still excellent, but I want it for deep checkpoints, not every draft loop
- Opus is still excellent, but I want it for hard rescues, not every execution loop
That is a much healthier operational model.
3. Composer 2 Fast now belongs in this article
This is the biggest change in the rewrite.
Cursor’s own materials now make Composer 2 Fast too relevant to ignore:
- strong coding benchmarks
- faster default variant
- same intelligence claim for the fast option
- lower cost than other fast models
For a Spec Kit workflow that already lives heavily inside Cursor, that changes the practical recommendation.
My Final Take
If you want the shortest possible answer, it is this:
- the idea behind the original hypothesis is still correct
- but the best stack in practice changes once you account for real failure modes
So the version I would actually hand to an engineering team today is:
- Composer 2 Fast for
constitution,specify, andclarify - GPT-5.4 at
mediumorhighforplan,tasks, andanalyze - Claude Sonnet 4.6 for default
implement - Gemini 3 Flash Preview or Gemini 3.1 Pro Preview only when you want a grounded Google path
- GPT-5.4 xhigh/pro and Claude Opus 4.6 only as escalation tools
That is probably the most important lesson from this research:
the winning move is not one magical model. It is matching the model to the failure mode.
Source List
- GitHub Spec Kit README
- Cursor Docs: Modes
- Cursor Docs: Selecting Models
- Cursor Docs: API Keys
- Cursor: Introducing Composer 2
- Cursor: Composer 2 Technical Report
- Google AI Docs: Models
- Google AI Docs: Gemini 3
- Google AI Docs: Gemini 3 Flash Preview
- Google AI Docs: Gemini 3.1 Pro Preview
- Google AI Docs: Grounding with Google Search
- Google AI Docs: Structured Outputs
- Google AI Docs: Safety guidance
- OpenAI Models Overview
- OpenAI Compare Models
- OpenAI GPT-5.4 pro
- Claude Code Docs: Model configuration
- Claude Code Docs: Fast mode
- Claude API Docs: What’s new in Claude 4.6
- Claude Code Docs: Changelog