A Phase-by-Phase Model Strategy for GitHub Spec Kit in 2026

My last Spec Kit article answered a brownfield question:

How do I reverse-engineer one existing feature into usable Spec Kit artifacts?

The next question is the one I kept hearing right after that:

Should every Spec Kit phase use the same model?

My working hypothesis was:

use a fast model for constitution, specify, and clarify
use a stronger reasoning model for plan, tasks, and analyze
use a stronger coding model for implement

That part still holds up.

But when I rechecked the current vendor docs on April 14, 2026, the exact stack I had in mind needed clearer naming and role separation:

Google’s current lineup makes the split clearer: Gemini 3 Flash Preview for the newer fast slot and Gemini 3.1 Pro Preview for Google’s heavier Pro-class work
OpenAI’s current public naming is GPT-5.4 or GPT-5.4 pro with a reasoning effort setting like xhigh, not a model literally called gpt-5.4-xhigh
Anthropic’s current docs describe Claude Opus 4.6 and Claude Sonnet 4.6, and Claude Code documents an opusplan mode that uses Opus for planning and Sonnet for execution

So this article is not “I proved my exact original hypothesis was perfectly right.”

It is a more useful follow-up:

the phase-splitting idea is strong, but the exact model mix should be updated for 2026.

TL;DR

Yes, I still think phase-specific model routing is the right mental model for GitHub Spec Kit.
No, I no longer think the best article is just “fast model here, smart model there, coding model there.”
The more useful question is: what breaks first when you actually run this workflow?
The common complaints are surprisingly consistent:
- Gemini can feel too creative unless you ground it hard
- GPT-5.4 can become too slow if you let it reason deeply on every loop
- Claude Opus is powerful, but often too slow to be the default executor
The most interesting new variable is Composer 2 Fast inside Cursor:
- Cursor says Composer 2 has frontier-level coding performance
- Cursor says the faster variant has the same intelligence, lower cost than other fast models, and is now the default fast option
My current practical default stack is:
1. constitution, specify, clarify with Composer 2 Fast inside Cursor
2. plan, tasks, analyze with GPT-5.4 at medium or high, not xhigh by default
3. implement with Claude Sonnet 4.6 by default, escalating to Claude Opus 4.6 only when the code path is unusually hard
I still think Gemini 3 Flash Preview and Gemini 3.1 Pro Preview are useful, but I now see them as grounded Google paths, not my default answer for every fast loop

Who This Is For

This article is for teams that:

already use, or want to use, GitHub Spec Kit as a real workflow instead of a one-off prompt
are comfortable working in Cursor or another model-switching environment
care enough about planning quality to tolerate some operational complexity
want a stronger answer than “just use the same model for everything”

This article is probably not for teams that:

want the fewest possible moving parts
do not want to manage multiple providers, billing paths, or model-specific behavior
are still early enough in adoption that one strong default model would be easier to socialize

If that is your situation, skip ahead to Alternative 4: Maximum simplicity.

What Keeps Breaking in Practice

This is the section I think the earlier version underplayed.

Theoretical strengths matter, but recurring failure modes matter more.

1. Gemini can drift unless you ground it hard

This is the complaint I hear most often about Gemini in spec-first workflows.

Not “Gemini is useless.”

More like:

the draft is fast
the prose sounds good
but some of it is too confident relative to the repo evidence

Google’s own Gemini API docs are a useful reminder here. Their safety guidance explicitly says large language models can generate text that is factually incorrect. Their grounding docs say grounding with Google Search can reduce hallucinations and improve factual accuracy. Their structured outputs docs also warn that even schema-valid output may still fail your business logic checks.

So my updated read is:

Gemini is still attractive for speed
but it is safest when you force grounding, evidence-driven prompts, and post-generation validation

That means I no longer love Gemini as an unguarded default for final artifact drafting.

2. GPT-5.4 can reason too much for inner loops

I still like GPT-5.4 a lot for planning.

But the common complaint is also real:

if you use GPT-5.4 on every draft loop
or if you jump straight to high/xhigh too often
the workflow can become noticeably slower than it needs to be

OpenAI’s current docs support that tradeoff. They position GPT-5.4 as the flagship for complex reasoning and coding, while also explicitly recommending GPT-5.4 mini and GPT-5.4 nano for lower-latency, lower-cost workloads. They also note that GPT-5.4 pro may take several minutes on harder requests.

So the issue is not “GPT-5.4 is bad.”

The issue is:

it is excellent for the checkpoint
it is often overkill for the inner loop

3. Claude Opus is too slow to be the default executor

Anthropic’s docs line up almost perfectly with the complaint here.

They describe:

Claude Opus 4.6 as the most intelligent model for agents and coding
Claude Sonnet 4.6 as the best combination of speed and intelligence

Claude Code also documents:

opusplan, which uses Opus for planning and Sonnet for execution
fast mode for Opus 4.6

That is a strong signal.

If Anthropic itself ships:

Opus for the hard reasoning moments
Sonnet for daily execution

then I do not think Opus should be the default implementation model in this article anymore.

4. Composer 2 Fast belongs in this conversation now

This is the biggest addition to the article.

Cursor’s current Composer 2 materials say:

Composer 2 has frontier-level coding performance
Composer 2 scored 61.3 on CursorBench, 61.7 on Terminal-Bench 2.0, and 73.7 on SWE-bench Multilingual
the faster variant has the same intelligence
it has lower cost than other fast models
Cursor is making the fast variant the default option

That does not automatically make Composer 2 Fast the best model for every Spec Kit phase.

But it absolutely makes it worth analyzing as:

the fastest coding-native loop inside Cursor
a strong default for iterative drafting
a serious alternative when GPT-5.4 and Opus feel too slow

The Recommendation in One View

If you only want the answer before the evidence, this is the version I would hand to an engineering team today:

Spec Kit phase	My practical default	Why	Escalate to
`constitution`	Composer 2 Fast	Fast Cursor-native repo synthesis	Gemini 3.1 Pro Preview or GPT-5.4 if the repo is unusually subtle
`specify`	Composer 2 Fast	Fast drafting with strong coding context	Gemini 3 Flash Preview with grounding if you want Google
`clarify`	Composer 2 Fast	Quick iteration without GPT-5.4 latency	Gemini 3.1 Pro Preview for harder clarification passes
`plan`	GPT-5.4 (`medium` or `high`)	Better architecture reasoning	`xhigh` or GPT-5.4 pro only when the plan is high-stakes or stuck
`tasks`	GPT-5.4 (`medium` or `high`)	Better decomposition and dependency quality	`xhigh` if the breakdown is still weak
`analyze`	GPT-5.4 (`medium` or `high`)	Better cross-artifact consistency checking	GPT-5.4 pro for rare deep review passes
`implement`	Claude Sonnet 4.6	Best daily speed/intelligence balance	Claude Opus 4.6 for hard rescues, Composer 2 Fast for quick local loops

What I Can Actually Prove Here

I want to be careful with the word “proof.”

What I can prove from the current docs is:

Spec Kit phases are distinct enough that routing them differently is operationally reasonable
Google itself documents factual inaccuracy risk, and grounding/structured output mitigations
OpenAI itself distinguishes flagship deep reasoning from lower-latency smaller variants
Anthropic itself distinguishes Opus from Sonnet by speed/intelligence role
Claude Code itself productizes a plan/execute split
Cursor now has a first-party fast coding model path worth taking seriously

What I did not do here is run a controlled benchmark with identical prompts, identical repositories, and scored outputs across all providers.

So treat this article as source-backed workflow research plus practical engineering analysis, not as a lab-grade benchmark paper.

Why the Naive Phase Map Stops Being Useful

The old version of the article mostly asked:

which model is strongest for this phase?

The more useful version asks:

what failure mode hurts this phase most?

For example:

constitution, specify, clarify suffer most from slow iteration and overconfident drafting
plan, tasks, analyze suffer most from shallow reasoning and inconsistency across artifacts
implement suffers most from latency during repair loops

Once you see the workflow that way, the stack changes.

What the Official Docs Support

1. Spec Kit still gives us the right phase boundaries

The current Spec Kit README documents the core flow as:

/speckit.constitution
/speckit.specify
/speckit.plan
/speckit.tasks
/speckit.implement

It also documents optional commands like:

/speckit.clarify
/speckit.analyze
/speckit.checklist

That still matters. The phases are real. The question is how aggressively you want to optimize each one.

2. Google docs support the Gemini concern and the Gemini mitigation

Google’s own Gemini API docs do not say “Gemini hallucinates too much.”

But they do say enough to make the concern real:

Gemini output can be factually incorrect
grounding with Google Search can reduce hallucinations
structured output improves formatting but does not guarantee business correctness

That leads to a more careful Google recommendation:

use Gemini 3 Flash Preview when you want the fast Google path
use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
but treat Gemini as much safer when the prompt is tightly grounded in repo evidence, structured output, or search grounding

3. OpenAI docs support using GPT-5.4 as a checkpoint model, not an everywhere model

OpenAI’s current docs say:

GPT-5.4 is the flagship for complex reasoning and coding
GPT-5.4 mini is a faster, more efficient model for high-volume workloads
GPT-5.4 pro can take several minutes on hard requests

That maps well to:

use GPT-5.4 for the serious planning checkpoints
do not assume GPT-5.4 at xhigh belongs in every loop

This is the article change I feel best about.

The issue was never that GPT-5.4 was weak.

The issue was that I was implicitly treating a checkpoint model like an iteration model.

4. Anthropic docs support Sonnet as the default executor much more than Opus

Anthropic’s current docs say:

Claude Sonnet 4.6 is the best combination of speed and intelligence
Claude Opus 4.6 is the most intelligent model for agents and coding
fast mode makes Opus 4.6 up to 2.5x faster

Claude Code adds two practical workflow signals:

opusplan = Opus for planning, Sonnet for execution
model aliases explicitly position Sonnet for daily coding tasks

So if the complaint is “Claude Opus is simply too slow,” the docs are basically pointing to the same answer:

yes, keep Opus
no, do not make it the default execution loop

5. Cursor’s Composer 2 Fast is the most important new addition

The biggest thing missing from the earlier version of this article was Cursor’s own fast-model path.

Cursor now says:

Composer 2 is frontier-level at coding
Composer 2 has strong results on CursorBench and other coding benchmarks
the faster variant has the same intelligence
the fast variant has lower cost than other fast models
Cursor is making the fast variant the default

Because this workflow already assumes Cursor for a lot of Spec Kit work, Composer 2 Fast deserves a first-class role in the article.

It is not automatically better than GPT-5.4 on deep planning.

But it is extremely relevant for:

fast constitution/spec drafting
quick clarification loops
quick code-edit loops when Sonnet or Opus feel too slow

The Operational Tax of Doing This

This is still the part that can get lost if the article only talks about model quality.

Routing phases across providers is not free. It adds:

more API-key and provider setup
more billing surfaces to monitor
more context handoff between phases
more workflow discipline so engineers know when to switch models
more debugging overhead when results are bad and the team has to ask whether the issue was the prompt, the phase, the model, or the handoff

That means I would only recommend a cross-provider stack like this when at least one of these is true:

planning quality is expensive to get wrong
the team already uses multiple providers
early-phase speed matters enough to justify the extra setup

If none of that is true, a simpler one-provider workflow will often beat a clever multi-model workflow in real life.

My Updated Recommendation

If I were standardizing this for a team today, I would update the original hypothesis to this:

Stage 1: Constitution, Specify, Clarify

Default:

Composer 2 Fast

Why:

it is now the most compelling fast loop available directly inside Cursor
Cursor positions it as a frontier coding model with a faster default variant
fast drafting matters a lot more in these early phases than maximum reasoning depth

When I would use Google instead:

use Gemini 3 Flash Preview when you want Google’s fast path
use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
in both cases, add stronger grounding and validation than I used to

Stage 2: Plan, Tasks, Analyze

Default:

GPT-5.4 with medium or high

Escalation:

use xhigh only when the plan is still weak after a normal pass
use GPT-5.4 pro only when the architecture is unusually expensive to get wrong and extra latency is acceptable

Why:

these are the phases where deeper reasoning usually pays for itself
this is where artifact consistency matters most
this is where weak reasoning shows up later as rework, missing dependencies, or broken acceptance logic

One subtle but important correction:

I would describe this setup as GPT-5.4 at medium/high by default
I would not jump to xhigh unless the task actually needs it

Stage 3: Implement

Default:

Claude Sonnet 4.6

Secondary fast loop:

Composer 2 Fast when I want a quick local iteration loop inside Cursor

Escalation:

Claude Opus 4.6 when the change is especially risky, architecture-heavy, or likely to require more autonomous repair loops

Why:

Sonnet is now the clearest “daily driver” implementation model in the official docs
Opus is valuable, but too slow to be my default execution loop
Composer 2 Fast is now good enough that it belongs in the practical implementation conversation too

How I Would Run This in Practice

Inside Cursor, I would stop thinking in terms of “one model per article” and start thinking in terms of inner loops and checkpoint loops.

Here is one concrete example.

Imagine I am working on a brownfield SaaS repo and want to document, then extend, an existing invoice export flow:

I would use Composer 2 Fast for constitution, specify, and clarify to iterate quickly and keep the drafting loop tight.
If the draft starts feeling too loose or too code-assumptive, I would try Gemini 3 Flash Preview or Gemini 3.1 Pro Preview, but only with stronger grounding and evidence constraints.
I would switch to GPT-5.4 at medium or high for plan, tasks, and analyze so the architecture and cross-artifact consistency checks are stronger.
I would hand implement to Claude Sonnet 4.6 for the default coding pass.
I would escalate only when needed:
- GPT-5.4 xhigh/pro for hard planning deadlocks
- Claude Opus 4.6 for hard implementation rescues
- Composer 2 Fast when I want a quick local coding loop without paying the Opus latency tax

That is the kind of workflow I mean throughout this article: not model-switching for its own sake, but deliberate switching because the failure mode changes.

1. Fast Draft

Use for:

/speckit.constitution
/speckit.specify
/speckit.clarify

Default:

Composer 2 Fast

Instruction focus:

stay grounded in repo evidence
optimize for clarity and flow
do not over-engineer

2. Deep Checkpoint

Use for:

/speckit.plan
/speckit.tasks
/speckit.analyze

Default:

GPT-5.4 medium/high

Instruction focus:

optimize for internal consistency
preserve requirements exactly
expose risks and dependencies explicitly

3. Daily Execution

Use for:

/speckit.implement
repair loops after tests or runtime failures

Default:

Claude Sonnet 4.6

Instruction focus:

make minimal, verifiable code changes
keep implementation aligned to spec.md, plan.md, and tasks.md
prefer passing tests and small diffs over cleverness

Alternatives I Would Seriously Consider

The goal is not to force one exact stack. The goal is to choose a stack that matches your team’s priorities.

If I had to reduce the decision down to one quick rubric, it would be:

If your priority is…	Start with…
fastest Cursor-native loop	Alternative 1: Cursor-native speed path
strongest planning depth	Alternative 2: OpenAI checkpoint path
Google-first path with stronger grounding	Alternative 3: Grounded Gemini path
lowest workflow complexity	Alternative 4: Maximum simplicity

Alternative 1: Cursor-native speed path

constitution, specify, clarify with Composer 2 Fast
plan, tasks, analyze with GPT-5.4 medium/high
implement with Claude Sonnet 4.6

Best when:

you live in Cursor already
your biggest pain is slow iteration
you want the tightest practical loop without defaulting to GPT-5.4 or Opus everywhere

Alternative 2: OpenAI checkpoint path

fast draft with Composer 2 Fast
planning checkpoints with GPT-5.4
hardest planning cases with GPT-5.4 pro
implementation with Claude Sonnet 4.6

Best when:

your biggest pain is planning quality, not raw implementation speed
you are comfortable paying latency only at checkpoint moments

Alternative 3: Grounded Gemini path

constitution, specify, clarify with Gemini 3 Flash Preview
heavier clarification or synthesis passes with Gemini 3.1 Pro Preview
planning checkpoints with GPT-5.4
implementation with Claude Sonnet 4.6

Best when:

you want to stay closer to Google’s current model family
you are willing to add grounding and validation guardrails
you still value the speed of Gemini for drafting

Alternative 4: Maximum simplicity

one strong general model for every phase

Best when:

your team will not actually maintain a multi-model workflow
operational simplicity matters more than squeezing out phase-by-phase gains

This is still the alternative I would choose for most teams before I choose a clever but brittle orchestration.

What Changed My Mind Most

Three things:

1. The right question is now failure-mode routing, not just phase routing

The old article was directionally right, but too clean.

The practical question is not only:

which model is strongest for this phase?

It is also:

which model fails in the most painful way for this phase?

2. GPT-5.4 and Opus are better as checkpoints than as defaults

This was the sharpest workflow lesson.

GPT-5.4 is still excellent, but I want it for deep checkpoints, not every draft loop
Opus is still excellent, but I want it for hard rescues, not every execution loop

That is a much healthier operational model.

3. Composer 2 Fast now belongs in this article

This is the biggest change in the rewrite.

Cursor’s own materials now make Composer 2 Fast too relevant to ignore:

strong coding benchmarks
faster default variant
same intelligence claim for the fast option
lower cost than other fast models

For a Spec Kit workflow that already lives heavily inside Cursor, that changes the practical recommendation.

My Final Take

If you want the shortest possible answer, it is this:

the idea behind the original hypothesis is still correct
but the best stack in practice changes once you account for real failure modes

So the version I would actually hand to an engineering team today is:

Composer 2 Fast for constitution, specify, and clarify
GPT-5.4 at medium or high for plan, tasks, and analyze
Claude Sonnet 4.6 for default implement
Gemini 3 Flash Preview or Gemini 3.1 Pro Preview only when you want a grounded Google path
GPT-5.4 xhigh/pro and Claude Opus 4.6 only as escalation tools

That is probably the most important lesson from this research:

the winning move is not one magical model. It is matching the model to the failure mode.

Luis Mori Guerra

Recent Articles

Topics

TL;DR

Who This Is For

What Keeps Breaking in Practice

1. Gemini can drift unless you ground it hard

2. GPT-5.4 can reason too much for inner loops

3. Claude Opus is too slow to be the default executor

4. Composer 2 Fast belongs in this conversation now

The Recommendation in One View

What I Can Actually Prove Here

Why the Naive Phase Map Stops Being Useful

What the Official Docs Support

1. Spec Kit still gives us the right phase boundaries

2. Google docs support the Gemini concern and the Gemini mitigation

3. OpenAI docs support using GPT-5.4 as a checkpoint model, not an everywhere model

4. Anthropic docs support Sonnet as the default executor much more than Opus

5. Cursor’s Composer 2 Fast is the most important new addition

The Operational Tax of Doing This

My Updated Recommendation

Stage 1: Constitution, Specify, Clarify

Stage 2: Plan, Tasks, Analyze

Stage 3: Implement

How I Would Run This in Practice

1. Fast Draft

2. Deep Checkpoint

3. Daily Execution

Alternatives I Would Seriously Consider

Alternative 1: Cursor-native speed path

Alternative 2: OpenAI checkpoint path

Alternative 3: Grounded Gemini path

Alternative 4: Maximum simplicity

What Changed My Mind Most

1. The right question is now failure-mode routing, not just phase routing

2. GPT-5.4 and Opus are better as checkpoints than as defaults

3. Composer 2 Fast now belongs in this article

My Final Take

Source List