AI Coding Workflows

A Phase-by-Phase Model Strategy for GitHub Spec Kit in 2026

A source-backed follow-up on routing GitHub Spec Kit phases across fast drafting models, deeper planning models, and stronger implementation models without treating all steps as the same kind of work.

19 min read Updated Apr 14, 2026

My last Spec Kit article answered a brownfield question:

How do I reverse-engineer one existing feature into usable Spec Kit artifacts?

The next question is the one I kept hearing right after that:

Should every Spec Kit phase use the same model?

My working hypothesis was:

  • use a fast model for constitution, specify, and clarify
  • use a stronger reasoning model for plan, tasks, and analyze
  • use a stronger coding model for implement

That part still holds up.

But when I rechecked the current vendor docs on April 14, 2026, the exact stack I had in mind needed clearer naming and role separation:

  • Google’s current lineup makes the split clearer: Gemini 3 Flash Preview for the newer fast slot and Gemini 3.1 Pro Preview for Google’s heavier Pro-class work
  • OpenAI’s current public naming is GPT-5.4 or GPT-5.4 pro with a reasoning effort setting like xhigh, not a model literally called gpt-5.4-xhigh
  • Anthropic’s current docs describe Claude Opus 4.6 and Claude Sonnet 4.6, and Claude Code documents an opusplan mode that uses Opus for planning and Sonnet for execution

So this article is not “I proved my exact original hypothesis was perfectly right.”

It is a more useful follow-up:

the phase-splitting idea is strong, but the exact model mix should be updated for 2026.

TL;DR

  • Yes, I still think phase-specific model routing is the right mental model for GitHub Spec Kit.
  • No, I no longer think the best article is just “fast model here, smart model there, coding model there.”
  • The more useful question is: what breaks first when you actually run this workflow?
  • The common complaints are surprisingly consistent:
    • Gemini can feel too creative unless you ground it hard
    • GPT-5.4 can become too slow if you let it reason deeply on every loop
    • Claude Opus is powerful, but often too slow to be the default executor
  • The most interesting new variable is Composer 2 Fast inside Cursor:
    • Cursor says Composer 2 has frontier-level coding performance
    • Cursor says the faster variant has the same intelligence, lower cost than other fast models, and is now the default fast option
  • My current practical default stack is:
    1. constitution, specify, clarify with Composer 2 Fast inside Cursor
    2. plan, tasks, analyze with GPT-5.4 at medium or high, not xhigh by default
    3. implement with Claude Sonnet 4.6 by default, escalating to Claude Opus 4.6 only when the code path is unusually hard
  • I still think Gemini 3 Flash Preview and Gemini 3.1 Pro Preview are useful, but I now see them as grounded Google paths, not my default answer for every fast loop

Who This Is For

This article is for teams that:

  • already use, or want to use, GitHub Spec Kit as a real workflow instead of a one-off prompt
  • are comfortable working in Cursor or another model-switching environment
  • care enough about planning quality to tolerate some operational complexity
  • want a stronger answer than “just use the same model for everything”

This article is probably not for teams that:

  • want the fewest possible moving parts
  • do not want to manage multiple providers, billing paths, or model-specific behavior
  • are still early enough in adoption that one strong default model would be easier to socialize

If that is your situation, skip ahead to Alternative 4: Maximum simplicity.

What Keeps Breaking in Practice

This is the section I think the earlier version underplayed.

Theoretical strengths matter, but recurring failure modes matter more.

1. Gemini can drift unless you ground it hard

This is the complaint I hear most often about Gemini in spec-first workflows.

Not “Gemini is useless.”

More like:

  • the draft is fast
  • the prose sounds good
  • but some of it is too confident relative to the repo evidence

Google’s own Gemini API docs are a useful reminder here. Their safety guidance explicitly says large language models can generate text that is factually incorrect. Their grounding docs say grounding with Google Search can reduce hallucinations and improve factual accuracy. Their structured outputs docs also warn that even schema-valid output may still fail your business logic checks.

So my updated read is:

  • Gemini is still attractive for speed
  • but it is safest when you force grounding, evidence-driven prompts, and post-generation validation

That means I no longer love Gemini as an unguarded default for final artifact drafting.

2. GPT-5.4 can reason too much for inner loops

I still like GPT-5.4 a lot for planning.

But the common complaint is also real:

  • if you use GPT-5.4 on every draft loop
  • or if you jump straight to high/xhigh too often
  • the workflow can become noticeably slower than it needs to be

OpenAI’s current docs support that tradeoff. They position GPT-5.4 as the flagship for complex reasoning and coding, while also explicitly recommending GPT-5.4 mini and GPT-5.4 nano for lower-latency, lower-cost workloads. They also note that GPT-5.4 pro may take several minutes on harder requests.

So the issue is not “GPT-5.4 is bad.”

The issue is:

  • it is excellent for the checkpoint
  • it is often overkill for the inner loop

3. Claude Opus is too slow to be the default executor

Anthropic’s docs line up almost perfectly with the complaint here.

They describe:

  • Claude Opus 4.6 as the most intelligent model for agents and coding
  • Claude Sonnet 4.6 as the best combination of speed and intelligence

Claude Code also documents:

  • opusplan, which uses Opus for planning and Sonnet for execution
  • fast mode for Opus 4.6

That is a strong signal.

If Anthropic itself ships:

  • Opus for the hard reasoning moments
  • Sonnet for daily execution

then I do not think Opus should be the default implementation model in this article anymore.

4. Composer 2 Fast belongs in this conversation now

This is the biggest addition to the article.

Cursor’s current Composer 2 materials say:

  • Composer 2 has frontier-level coding performance
  • Composer 2 scored 61.3 on CursorBench, 61.7 on Terminal-Bench 2.0, and 73.7 on SWE-bench Multilingual
  • the faster variant has the same intelligence
  • it has lower cost than other fast models
  • Cursor is making the fast variant the default option

That does not automatically make Composer 2 Fast the best model for every Spec Kit phase.

But it absolutely makes it worth analyzing as:

  • the fastest coding-native loop inside Cursor
  • a strong default for iterative drafting
  • a serious alternative when GPT-5.4 and Opus feel too slow

The Recommendation in One View

If you only want the answer before the evidence, this is the version I would hand to an engineering team today:

Spec Kit phaseMy practical defaultWhyEscalate to
constitutionComposer 2 FastFast Cursor-native repo synthesisGemini 3.1 Pro Preview or GPT-5.4 if the repo is unusually subtle
specifyComposer 2 FastFast drafting with strong coding contextGemini 3 Flash Preview with grounding if you want Google
clarifyComposer 2 FastQuick iteration without GPT-5.4 latencyGemini 3.1 Pro Preview for harder clarification passes
planGPT-5.4 (medium or high)Better architecture reasoningxhigh or GPT-5.4 pro only when the plan is high-stakes or stuck
tasksGPT-5.4 (medium or high)Better decomposition and dependency qualityxhigh if the breakdown is still weak
analyzeGPT-5.4 (medium or high)Better cross-artifact consistency checkingGPT-5.4 pro for rare deep review passes
implementClaude Sonnet 4.6Best daily speed/intelligence balanceClaude Opus 4.6 for hard rescues, Composer 2 Fast for quick local loops

What I Can Actually Prove Here

I want to be careful with the word “proof.”

What I can prove from the current docs is:

  • Spec Kit phases are distinct enough that routing them differently is operationally reasonable
  • Google itself documents factual inaccuracy risk, and grounding/structured output mitigations
  • OpenAI itself distinguishes flagship deep reasoning from lower-latency smaller variants
  • Anthropic itself distinguishes Opus from Sonnet by speed/intelligence role
  • Claude Code itself productizes a plan/execute split
  • Cursor now has a first-party fast coding model path worth taking seriously

What I did not do here is run a controlled benchmark with identical prompts, identical repositories, and scored outputs across all providers.

So treat this article as source-backed workflow research plus practical engineering analysis, not as a lab-grade benchmark paper.

Why the Naive Phase Map Stops Being Useful

The old version of the article mostly asked:

  • which model is strongest for this phase?

The more useful version asks:

  • what failure mode hurts this phase most?

For example:

  • constitution, specify, clarify suffer most from slow iteration and overconfident drafting
  • plan, tasks, analyze suffer most from shallow reasoning and inconsistency across artifacts
  • implement suffers most from latency during repair loops

Once you see the workflow that way, the stack changes.

What the Official Docs Support

1. Spec Kit still gives us the right phase boundaries

The current Spec Kit README documents the core flow as:

  • /speckit.constitution
  • /speckit.specify
  • /speckit.plan
  • /speckit.tasks
  • /speckit.implement

It also documents optional commands like:

  • /speckit.clarify
  • /speckit.analyze
  • /speckit.checklist

That still matters. The phases are real. The question is how aggressively you want to optimize each one.

2. Google docs support the Gemini concern and the Gemini mitigation

Google’s own Gemini API docs do not say “Gemini hallucinates too much.”

But they do say enough to make the concern real:

  • Gemini output can be factually incorrect
  • grounding with Google Search can reduce hallucinations
  • structured output improves formatting but does not guarantee business correctness

That leads to a more careful Google recommendation:

  • use Gemini 3 Flash Preview when you want the fast Google path
  • use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
  • but treat Gemini as much safer when the prompt is tightly grounded in repo evidence, structured output, or search grounding

3. OpenAI docs support using GPT-5.4 as a checkpoint model, not an everywhere model

OpenAI’s current docs say:

  • GPT-5.4 is the flagship for complex reasoning and coding
  • GPT-5.4 mini is a faster, more efficient model for high-volume workloads
  • GPT-5.4 pro can take several minutes on hard requests

That maps well to:

  • use GPT-5.4 for the serious planning checkpoints
  • do not assume GPT-5.4 at xhigh belongs in every loop

This is the article change I feel best about.

The issue was never that GPT-5.4 was weak.

The issue was that I was implicitly treating a checkpoint model like an iteration model.

4. Anthropic docs support Sonnet as the default executor much more than Opus

Anthropic’s current docs say:

  • Claude Sonnet 4.6 is the best combination of speed and intelligence
  • Claude Opus 4.6 is the most intelligent model for agents and coding
  • fast mode makes Opus 4.6 up to 2.5x faster

Claude Code adds two practical workflow signals:

  • opusplan = Opus for planning, Sonnet for execution
  • model aliases explicitly position Sonnet for daily coding tasks

So if the complaint is “Claude Opus is simply too slow,” the docs are basically pointing to the same answer:

  • yes, keep Opus
  • no, do not make it the default execution loop

5. Cursor’s Composer 2 Fast is the most important new addition

The biggest thing missing from the earlier version of this article was Cursor’s own fast-model path.

Cursor now says:

  • Composer 2 is frontier-level at coding
  • Composer 2 has strong results on CursorBench and other coding benchmarks
  • the faster variant has the same intelligence
  • the fast variant has lower cost than other fast models
  • Cursor is making the fast variant the default

Because this workflow already assumes Cursor for a lot of Spec Kit work, Composer 2 Fast deserves a first-class role in the article.

It is not automatically better than GPT-5.4 on deep planning.

But it is extremely relevant for:

  • fast constitution/spec drafting
  • quick clarification loops
  • quick code-edit loops when Sonnet or Opus feel too slow

The Operational Tax of Doing This

This is still the part that can get lost if the article only talks about model quality.

Routing phases across providers is not free. It adds:

  • more API-key and provider setup
  • more billing surfaces to monitor
  • more context handoff between phases
  • more workflow discipline so engineers know when to switch models
  • more debugging overhead when results are bad and the team has to ask whether the issue was the prompt, the phase, the model, or the handoff

That means I would only recommend a cross-provider stack like this when at least one of these is true:

  • planning quality is expensive to get wrong
  • the team already uses multiple providers
  • early-phase speed matters enough to justify the extra setup

If none of that is true, a simpler one-provider workflow will often beat a clever multi-model workflow in real life.

My Updated Recommendation

If I were standardizing this for a team today, I would update the original hypothesis to this:

Stage 1: Constitution, Specify, Clarify

Default:

  • Composer 2 Fast

Why:

  • it is now the most compelling fast loop available directly inside Cursor
  • Cursor positions it as a frontier coding model with a faster default variant
  • fast drafting matters a lot more in these early phases than maximum reasoning depth

When I would use Google instead:

  • use Gemini 3 Flash Preview when you want Google’s fast path
  • use Gemini 3.1 Pro Preview when you want Google’s heavier engineering-oriented path
  • in both cases, add stronger grounding and validation than I used to

Stage 2: Plan, Tasks, Analyze

Default:

  • GPT-5.4 with medium or high

Escalation:

  • use xhigh only when the plan is still weak after a normal pass
  • use GPT-5.4 pro only when the architecture is unusually expensive to get wrong and extra latency is acceptable

Why:

  • these are the phases where deeper reasoning usually pays for itself
  • this is where artifact consistency matters most
  • this is where weak reasoning shows up later as rework, missing dependencies, or broken acceptance logic

One subtle but important correction:

  • I would describe this setup as GPT-5.4 at medium/high by default
  • I would not jump to xhigh unless the task actually needs it

Stage 3: Implement

Default:

  • Claude Sonnet 4.6

Secondary fast loop:

  • Composer 2 Fast when I want a quick local iteration loop inside Cursor

Escalation:

  • Claude Opus 4.6 when the change is especially risky, architecture-heavy, or likely to require more autonomous repair loops

Why:

  • Sonnet is now the clearest “daily driver” implementation model in the official docs
  • Opus is valuable, but too slow to be my default execution loop
  • Composer 2 Fast is now good enough that it belongs in the practical implementation conversation too

How I Would Run This in Practice

Inside Cursor, I would stop thinking in terms of “one model per article” and start thinking in terms of inner loops and checkpoint loops.

Here is one concrete example.

Imagine I am working on a brownfield SaaS repo and want to document, then extend, an existing invoice export flow:

  1. I would use Composer 2 Fast for constitution, specify, and clarify to iterate quickly and keep the drafting loop tight.
  2. If the draft starts feeling too loose or too code-assumptive, I would try Gemini 3 Flash Preview or Gemini 3.1 Pro Preview, but only with stronger grounding and evidence constraints.
  3. I would switch to GPT-5.4 at medium or high for plan, tasks, and analyze so the architecture and cross-artifact consistency checks are stronger.
  4. I would hand implement to Claude Sonnet 4.6 for the default coding pass.
  5. I would escalate only when needed:
    • GPT-5.4 xhigh/pro for hard planning deadlocks
    • Claude Opus 4.6 for hard implementation rescues
    • Composer 2 Fast when I want a quick local coding loop without paying the Opus latency tax

That is the kind of workflow I mean throughout this article: not model-switching for its own sake, but deliberate switching because the failure mode changes.

1. Fast Draft

Use for:

  • /speckit.constitution
  • /speckit.specify
  • /speckit.clarify

Default:

  • Composer 2 Fast

Instruction focus:

  • stay grounded in repo evidence
  • optimize for clarity and flow
  • do not over-engineer

2. Deep Checkpoint

Use for:

  • /speckit.plan
  • /speckit.tasks
  • /speckit.analyze

Default:

  • GPT-5.4 medium/high

Instruction focus:

  • optimize for internal consistency
  • preserve requirements exactly
  • expose risks and dependencies explicitly

3. Daily Execution

Use for:

  • /speckit.implement
  • repair loops after tests or runtime failures

Default:

  • Claude Sonnet 4.6

Instruction focus:

  • make minimal, verifiable code changes
  • keep implementation aligned to spec.md, plan.md, and tasks.md
  • prefer passing tests and small diffs over cleverness

Alternatives I Would Seriously Consider

The goal is not to force one exact stack. The goal is to choose a stack that matches your team’s priorities.

If I had to reduce the decision down to one quick rubric, it would be:

If your priority is…Start with…
fastest Cursor-native loopAlternative 1: Cursor-native speed path
strongest planning depthAlternative 2: OpenAI checkpoint path
Google-first path with stronger groundingAlternative 3: Grounded Gemini path
lowest workflow complexityAlternative 4: Maximum simplicity

Alternative 1: Cursor-native speed path

  • constitution, specify, clarify with Composer 2 Fast
  • plan, tasks, analyze with GPT-5.4 medium/high
  • implement with Claude Sonnet 4.6

Best when:

  • you live in Cursor already
  • your biggest pain is slow iteration
  • you want the tightest practical loop without defaulting to GPT-5.4 or Opus everywhere

Alternative 2: OpenAI checkpoint path

  • fast draft with Composer 2 Fast
  • planning checkpoints with GPT-5.4
  • hardest planning cases with GPT-5.4 pro
  • implementation with Claude Sonnet 4.6

Best when:

  • your biggest pain is planning quality, not raw implementation speed
  • you are comfortable paying latency only at checkpoint moments

Alternative 3: Grounded Gemini path

  • constitution, specify, clarify with Gemini 3 Flash Preview
  • heavier clarification or synthesis passes with Gemini 3.1 Pro Preview
  • planning checkpoints with GPT-5.4
  • implementation with Claude Sonnet 4.6

Best when:

  • you want to stay closer to Google’s current model family
  • you are willing to add grounding and validation guardrails
  • you still value the speed of Gemini for drafting

Alternative 4: Maximum simplicity

  • one strong general model for every phase

Best when:

  • your team will not actually maintain a multi-model workflow
  • operational simplicity matters more than squeezing out phase-by-phase gains

This is still the alternative I would choose for most teams before I choose a clever but brittle orchestration.

What Changed My Mind Most

Three things:

1. The right question is now failure-mode routing, not just phase routing

The old article was directionally right, but too clean.

The practical question is not only:

  • which model is strongest for this phase?

It is also:

  • which model fails in the most painful way for this phase?

2. GPT-5.4 and Opus are better as checkpoints than as defaults

This was the sharpest workflow lesson.

  • GPT-5.4 is still excellent, but I want it for deep checkpoints, not every draft loop
  • Opus is still excellent, but I want it for hard rescues, not every execution loop

That is a much healthier operational model.

3. Composer 2 Fast now belongs in this article

This is the biggest change in the rewrite.

Cursor’s own materials now make Composer 2 Fast too relevant to ignore:

  • strong coding benchmarks
  • faster default variant
  • same intelligence claim for the fast option
  • lower cost than other fast models

For a Spec Kit workflow that already lives heavily inside Cursor, that changes the practical recommendation.

My Final Take

If you want the shortest possible answer, it is this:

  • the idea behind the original hypothesis is still correct
  • but the best stack in practice changes once you account for real failure modes

So the version I would actually hand to an engineering team today is:

  1. Composer 2 Fast for constitution, specify, and clarify
  2. GPT-5.4 at medium or high for plan, tasks, and analyze
  3. Claude Sonnet 4.6 for default implement
  4. Gemini 3 Flash Preview or Gemini 3.1 Pro Preview only when you want a grounded Google path
  5. GPT-5.4 xhigh/pro and Claude Opus 4.6 only as escalation tools

That is probably the most important lesson from this research:

the winning move is not one magical model. It is matching the model to the failure mode.

Source List