AI Coding Workflows

A Phase-by-Phase Model Strategy for GitHub Spec Kit in 2026

A source-backed follow-up on routing GitHub Spec Kit phases across fast drafting models, deeper planning models, and stronger implementation models without treating all steps as the same kind of work.

17 min read Updated Apr 14, 2026

My last Spec Kit article answered a brownfield question:

How do I reverse-engineer one existing feature into usable Spec Kit artifacts?

The next question is the one I kept hearing right after that:

Should every Spec Kit phase use the same model?

My working hypothesis was:

  • use a fast model for constitution, specify, and clarify
  • use a stronger reasoning model for plan, tasks, and analyze
  • use a stronger coding model for implement

That part still holds up.

But when I rechecked the current vendor docs on April 14, 2026, the exact stack I had in mind needed clearer naming and role separation:

  • Google’s current lineup makes the split clearer: Gemini 3 Flash Preview for the newer fast slot and Gemini 3.1 Pro Preview for Google’s Pro-class, software-engineering-oriented work
  • OpenAI’s current public naming is GPT-5.4 or GPT-5.4 pro with a reasoning effort setting like xhigh, not a model literally called gpt-5.4-xhigh
  • Anthropic’s current docs describe Claude Opus 4.6 and Claude Sonnet 4.6, and Claude Code documents an opusplan mode that uses Opus for planning and Sonnet for execution

So this article is not “I proved my exact original hypothesis was perfectly right.”

It is a more useful follow-up:

the phase-splitting idea is strong, but the exact model mix should be updated for 2026.

TL;DR

  • Yes, I think phase-specific model routing is the right mental model for GitHub Spec Kit.
  • No, I would not keep the original stack exactly as written.
  • My current default stack is:
    1. constitution, specify, clarify with Gemini 3 Flash Preview
    2. plan, tasks, analyze with GPT-5.4 at high or xhigh effort
    3. implement with Claude Sonnet 4.6 by default, escalating to Claude Opus 4.6 when the implementation is unusually architecture-sensitive
  • If you want a cleaner Anthropic-native option, Claude Code’s documented opusplan split is one of the strongest alternatives:
    1. Opus for planning
    2. Sonnet for execution
  • The most important Google nuance is simple:
    • use Gemini 3 Flash Preview for the fast slot
    • use Gemini 3.1 Pro Preview when you want Google’s Pro-class option

Who This Is For

This article is for teams that:

  • already use, or want to use, GitHub Spec Kit as a real workflow instead of a one-off prompt
  • are comfortable working in Cursor or another model-switching environment
  • care enough about planning quality to tolerate some operational complexity
  • want a stronger answer than “just use the same model for everything”

This article is probably not for teams that:

  • want the fewest possible moving parts
  • do not want to manage multiple providers, billing paths, or model-specific behavior
  • are still early enough in adoption that one strong default model would be easier to socialize

If that is your situation, skip ahead to Alternative 3: Maximum simplicity.

The Recommendation in One View

If you only want the answer before the evidence, this is the version I would hand to an engineering team today:

Spec Kit phaseMy 2026 defaultWhy
constitutionGemini 3 Flash PreviewFast repo-level synthesis
specifyGemini 3 Flash PreviewFast first draft of feature intent
clarifyGemini 3 Flash PreviewQuick ambiguity reduction
planGPT-5.4 (high or xhigh)Better architecture reasoning
tasksGPT-5.4 (high or xhigh)Better decomposition and dependency quality
analyzeGPT-5.4 (high or xhigh)Better cross-artifact consistency checking
implementClaude Sonnet 4.6Better speed-to-quality default for code execution
implement for hard casesClaude Opus 4.6Premium option for harder coding and repair loops

What I Can Actually Prove Here

I want to be careful with the word “proof.”

What I can prove from the current docs is:

  • Spec Kit phases are distinct enough that routing them differently is operationally reasonable
  • Cursor supports model-aware modes, which makes per-phase routing practical inside one IDE workflow
  • Google positions Gemini 3 Flash as a fast 3-series model with Pro-level intelligence at Flash speed/pricing
  • Google positions Gemini 3.1 Pro Preview around software engineering and agentic workflows
  • OpenAI positions GPT-5.4 around deeper reasoning and complex professional work
  • Anthropic explicitly productized a planner/executor split in Claude Code with opusplan

What I did not do here is run a lab-grade benchmark with identical prompts, fixed repositories, and scored outputs across all models.

So treat this article as source-backed workflow research, not as a leaderboard benchmark paper.

Why a Single Model for Every Phase Is Usually the Wrong Default

GitHub Spec Kit phases do not ask the model to do the same kind of work.

/speckit.constitution, /speckit.specify, and /speckit.clarify are usually about:

  • extracting invariants
  • turning vague intent into structured requirements
  • spotting ambiguity quickly
  • producing clean artifacts without overthinking every line

/speckit.plan, /speckit.tasks, and /speckit.analyze are heavier:

  • architecture tradeoffs
  • dependency ordering
  • coverage gaps
  • internal consistency across multiple artifacts

/speckit.implement is different again:

  • code generation
  • file-level changes
  • test updates
  • repair loops
  • local tool interaction

That is why one-model-for-everything often feels either:

  • too slow and expensive early, or
  • too shallow later

The docs from the vendors line up with that intuition more than I expected.

What the Official Docs Support

1. Spec Kit already separates the workflow into phases

The current Spec Kit README documents the core flow as:

  • /speckit.constitution
  • /speckit.specify
  • /speckit.plan
  • /speckit.tasks
  • /speckit.implement

It also documents optional commands like:

  • /speckit.clarify
  • /speckit.analyze
  • /speckit.checklist

That matters because model routing only makes sense if the workflow is already phase-based. Spec Kit is.

2. Cursor makes phase routing operationally plausible

Cursor’s docs on modes and model selection are the missing operational layer here.

The current docs say:

  • custom modes can have their own model
  • custom modes can have their own instructions
  • once you find prompt and model combinations that work well, you can save them as Custom Modes

That is basically the infrastructure you need for:

  • a Spec Draft mode
  • a Deep Plan mode
  • an Implement mode

One important caveat from Cursor’s API key docs: provider availability depends on the exact API-key path and setup you use. So if your model-routing design depends on bring-your-own-provider behavior, verify your actual Cursor setup before standardizing on it team-wide.

3. Google’s current docs support both the fast Gemini 3 Flash path and the Pro-class Gemini 3.1 path

This was the part of the stack that needed the most naming clarity.

On April 14, 2026, Google’s official model catalog says:

  • Gemini 3 Flash is Google’s latest 3-series Flash model, with Pro-level intelligence at the speed and pricing of Flash
  • Gemini 3.1 Pro Preview provides better thinking, improved token efficiency, and a more grounded experience, and Google says it is optimized for software engineering and agentic workflows

At the same time, Google’s current model descriptions say:

  • Gemini 3 Flash Preview is the better fit when the early phases need fast drafting, quick ambiguity reduction, and a newer fast frontier model from Google
  • Gemini 3.1 Pro Preview is the better fit when you want Google’s heavier Pro-class option for harder clarification, repo synthesis, or other software-engineering-heavy passes

That is strong evidence for the broader workflow idea:

  • yes, use Google for fast early-phase drafting if you like that part of the stack
  • yes, if you specifically want Google’s Pro-class option, use Gemini 3.1 Pro Preview

4. OpenAI’s docs strongly support the “deep planning model” part

OpenAI’s current model docs say:

  • GPT-5.4 is the best intelligence at scale for agentic, coding, and professional workflows
  • GPT-5.4 pro is a more precise version that can think harder and may take several minutes on harder requests
  • xhigh is a supported reasoning effort, not a separate model family

That maps well to:

  • plan
  • tasks
  • analyze

These phases are not just about generating text. They are about holding more constraints in mind at once and producing internally consistent artifacts.

This is also the place where I would spend more latency budget on purpose.

5. Anthropic gives the strongest evidence for a planner/executor split

Anthropic’s current docs gave me the most interesting confirmation.

Their public Claude 4.6 docs describe:

  • Claude Opus 4.6 as their most intelligent model for building agents and coding
  • Claude Sonnet 4.6 as the best combination of speed and intelligence

But Claude Code’s own model configuration docs go further and document opusplan:

  • use Opus during plan mode
  • switch to Sonnet for execution

That is a real productized version of the same idea I was exploring manually.

Anthropic’s Claude Code GitHub Actions docs reinforce the same pattern in a simpler way:

  • GitHub Actions default to Sonnet
  • you explicitly opt into Opus 4.6 when you want the heavier model

So if you wanted one sentence of “proof” that multi-model phase routing is not a weird personal habit, this is probably the strongest one:

Anthropic already ships a planning/execution split as a named mode.

The Operational Tax of Doing This

This is the part that can get lost if the article only talks about model quality.

Routing phases across providers is not free. It adds:

  • more API-key and provider setup
  • more billing surfaces to monitor
  • more context handoff between phases
  • more workflow discipline so engineers know when to switch models
  • more debugging overhead when results are bad and the team has to ask whether the issue was the prompt, the phase, the model, or the handoff

That means I would only recommend a cross-provider stack like this when at least one of these is true:

  • planning quality is expensive to get wrong
  • the team already uses multiple providers
  • early-phase speed matters enough to justify the extra setup

If none of that is true, a simpler one-provider workflow will often beat a clever multi-model workflow in real life.

My Updated Recommendation

If I were standardizing this for a team today, I would update the original hypothesis to this:

Stage 1: Constitution, Specify, Clarify

Default:

  • Gemini 3 Flash Preview

Why:

  • fast enough to keep iteration tight
  • positioned by Google as a newer 3-series Flash model with Pro-level intelligence at Flash speed/pricing
  • good fit for turning repo evidence and product intent into clean first-draft artifacts

When I would go lighter:

  • use a lighter Flash-class option such as Gemini 2.5 Flash-Lite when cost and speed matter more than nuance

When I would go heavier:

  • use Gemini 3.1 Pro Preview if you want Google’s newer Pro-class option for harder clarification or repo synthesis passes
  • use GPT-5.4 mini or Claude Sonnet 4.6 if you want to stay closer to the provider you already trust operationally

Stage 2: Plan, Tasks, Analyze

Default:

  • GPT-5.4 with high or xhigh reasoning effort

Escalation:

  • GPT-5.4 pro when the architecture is unusually expensive to get wrong and extra latency is acceptable

Why:

  • these are the phases where deeper reasoning usually pays for itself
  • this is where artifact consistency matters most
  • this is where weak reasoning shows up later as rework, missing dependencies, or broken acceptance logic

One subtle but important correction:

  • I would describe this setup as GPT-5.4 at xhigh effort
  • I would not describe it as a model called gpt-5.4-xhigh

Stage 3: Implement

Default:

  • Claude Sonnet 4.6

Escalation:

  • Claude Opus 4.6 when the change is especially risky, architecture-heavy, or likely to require more autonomous repair loops

Why I changed my mind here:

  • my original instinct was “use Opus for implementation because it is the strongest coding model”
  • but Anthropic’s own Claude Code docs suggest a more nuanced operating model:
    • Opus for planning
    • Sonnet for execution
  • Anthropic’s GitHub Actions docs also default automated execution to Sonnet unless you explicitly switch to Opus 4.6

That does not mean Opus is a bad implementation model.

It means I would not treat Opus as the universal default executor anymore. I would treat it as the premium escalation path for harder codegen and harder repair loops.

How I Would Run This in Practice

Inside Cursor, I would make the routing visible instead of keeping it in my head.

I would create three modes:

Here is one concrete example.

Imagine I am working on a brownfield SaaS repo and want to document, then extend, an existing invoice export flow:

  1. I would use Gemini 3 Flash Preview for constitution, specify, and clarify so I can quickly turn repo evidence and product notes into clean artifacts.
  2. I would switch to GPT-5.4 at high or xhigh for plan, tasks, and analyze so the architecture, dependency ordering, and consistency checks are stronger.
  3. I would hand implement to Claude Sonnet 4.6 for the default coding pass and only escalate to Claude Opus 4.6 if the implementation starts touching riskier cross-cutting areas than expected.

That is the kind of workflow I mean throughout this article: not model-switching for its own sake, but deliberate switching because the work itself changes shape.

1. Spec Draft

Use for:

  • /speckit.constitution
  • /speckit.specify
  • /speckit.clarify

Instruction focus:

  • stay grounded in repo evidence
  • optimize for clarity and flow
  • do not over-engineer

2. Deep Plan

Use for:

  • /speckit.plan
  • /speckit.tasks
  • /speckit.analyze

Instruction focus:

  • optimize for internal consistency
  • preserve requirements exactly
  • expose risks and dependencies explicitly

3. Implement

Use for:

  • /speckit.implement
  • repair loops after tests or runtime failures

Instruction focus:

  • make minimal, verifiable code changes
  • keep implementation aligned to spec.md, plan.md, and tasks.md
  • prefer passing tests and small diffs over cleverness

The important part is not the labels. The important part is that engineers can see:

  • which phase they are in
  • which model is expected there
  • why that model was chosen

That lowers drift a lot.

Alternatives I Would Seriously Consider

The goal is not to force one exact stack. The goal is to choose a stack that matches your team’s priorities.

If I had to reduce the decision down to one quick rubric, it would be:

If your priority is…Start with…
lowest workflow complexityAlternative 3: Maximum simplicity
easiest vendor consistencyAlternative 1: Anthropic-first
strongest planning depthAlternative 2: OpenAI-heavy
latest Google Pro-class optionAlternative 5: Google-first updated
fastest early-phase drafting with deeper planning laterthe default split in this article

Alternative 1: Anthropic-first and simpler

  • constitution, specify, clarify with Claude Sonnet 4.6
  • plan with Claude Opus 4.6
  • implement with Claude Sonnet 4.6

Best when:

  • you want less provider switching
  • you already use Claude Code heavily
  • you want something close to Anthropic’s documented opusplan philosophy

Alternative 2: OpenAI-heavy and planning-centric

  • early phases with GPT-5.4 mini
  • planning phases with GPT-5.4
  • hardest planning cases with GPT-5.4 pro

Best when:

  • you want to stay closer to one provider
  • your biggest pain is planning quality, not raw implementation speed
  • you are comfortable paying more latency for better artifact quality

Alternative 3: Maximum simplicity

  • one strong general model for every phase

Best when:

  • your team will not actually maintain a multi-model workflow
  • operational simplicity matters more than squeezing out phase-by-phase gains

This is the alternative I would choose for most teams before I choose a clever but brittle orchestration.

Alternative 4: OpenAI coding alternative outside the Cursor default

If you are building a broader coding workflow beyond just Cursor model selection, OpenAI’s current docs also position GPT-5.3-Codex as their most capable agentic coding model.

That makes it a real alternative for the execution side if your stack leans OpenAI more than Anthropic.

I would still separate:

  • drafting
  • planning
  • implementation

But the implementation slot does not have to belong to Claude if your tooling and costs point elsewhere.

Alternative 5: Google-first updated

  • constitution, specify, clarify with Gemini 3 Flash Preview
  • harder clarification or planning passes with Gemini 3.1 Pro Preview
  • implement with your preferred executor, usually Claude Sonnet 4.6 or another coding-first model

Best when:

  • you want to stay closer to Google’s current model family
  • you specifically want Google’s current Pro-class model in the stack
  • you like the default article logic of fast early drafting, but want a newer Google Pro-class escalation path

What Changed My Mind Most

Three things:

1. The hypothesis was right at the workflow level

The docs support the larger pattern:

  • fast model early
  • deeper model for planning
  • coding model for execution

That part is stronger after the review, not weaker.

2. The exact Google role mattered more than I expected

This was the sharpest clarification.

The useful Google split is not “one Gemini model everywhere.” It is:

  • Gemini 3 Flash Preview for speed-sensitive drafting
  • Gemini 3.1 Pro Preview for heavier Google-native passes

That is exactly why I now like dating these articles explicitly.

3. Anthropic’s opusplan is the cleanest alternative to manual routing

I expected to end up with a more custom conclusion.

Instead, Anthropic’s docs point to something simpler:

  • use the stronger model for planning
  • use the faster still-strong model for execution

That is a very good default pattern even if you never copy my exact stack.

My Final Take

If you want the shortest possible answer, it is this:

  • the idea behind the original hypothesis is correct
  • the exact model names should be updated for 2026

So the version I would actually hand to an engineering team today is:

  1. Gemini 3 Flash Preview for constitution, specify, and clarify
  2. GPT-5.4 at high or xhigh for plan, tasks, and analyze
  3. Claude Sonnet 4.6 for default implement, with Claude Opus 4.6 reserved for harder implementation cases

If the team wants the newest Google Pro-class option in the stack, I would use Gemini 3.1 Pro Preview alongside Gemini 3 Flash Preview rather than collapsing the whole Google side into one model.

And if the team wants something simpler and more vendor-consistent, I would look hard at an Anthropic-first setup built around the same principle Claude Code already documents in opusplan.

That is probably the most important lesson from this research:

the winning move is not one magical model. It is matching the model to the phase.

Source List