Skills vs Default Agents for Coding: Benchmarks and Practical Guidance

TL;DR

Skills are useful, but they are not magic. The best public evidence says narrow, curated, well-triggered skills can improve agent performance, while broad or poorly selected skills often behave like expensive prompt decoration.

For coding work in Claude Code and Cursor, my current recommendation is:

Use repo-level context (CLAUDE.md, AGENTS.md, .cursor/rules) for facts the agent should always know: stack, architecture, test commands, constraints, and “do not do this” guardrails.
Use skills for repeatable workflows with a clear entry point: code review, release notes, migration planning, API scaffolding, runbook execution, design-system usage, test-generation recipes.
Treat long agent conversations as workflow discovery. If you keep explaining the same procedure, extract it into a skill so future tasks do not repeat the same discovery loop.
Do not assume a skill helps because it feels organized. Benchmark it against a no-skill baseline.
Be especially careful with self-generated or marketplace skills. Several studies show that irrelevant, broad, or version-mismatched skills can provide zero gain or even reduce performance.

The dilemma is real, but the answer is not “skills vs no skills.” The answer is persistent context for stable rules, skills for task-specific procedures, and evals for proof.

What You Will Learn Here

What “skills” mean in Claude Code and Cursor.
What real benchmarks say about skill-augmented coding agents.
Why AGENTS.md or CLAUDE.md sometimes beats skills.
Where skills still make sense.
How to turn repeated agent chats into reusable procedural skills.
A practical decision framework for engineers and PMs.
A small code example for measuring skill lift in your own repo.

The Problem

AI coding agents are getting better even when you run them with no custom configuration. Claude Code and Cursor can already inspect files, run commands, edit code, use tools, and reason across a repository.

So the natural question is:

Should we invest in skills, or are default agents already good enough?

This question matters because skills are not free. They add maintenance, token overhead, trigger uncertainty, version drift, and sometimes security risk. But they can also encode workflow knowledge that a default model will not reliably infer.

Here is the useful mental model:

                 stable and always relevant
                           |
                           v
    CLAUDE.md / AGENTS.md / .cursor/rules
      architecture, conventions, constraints
                           |
                           v
                 coding agent session
                           |
                           v
               task-specific skill loads
       review workflow, migration recipe, template,
       domain checklist, release process, test harness

If the information should shape almost every task, it probably belongs in repo context. If it should activate only during a specific workflow, it is a skill candidate.

What Skills Are in Claude Code and Cursor

In Claude Code, a skill is a folder containing a SKILL.md file. Claude uses the skill description to decide when to load it, and the full skill content enters the conversation when invoked. Claude Code also supports supporting files like templates, examples, reference docs, and scripts.

.claude/skills/code-review/
  SKILL.md
  checklist.md
  examples/
    good-review.md
  scripts/
    collect-diff.sh

Cursor added Agent Skills support in Cursor 2.4, released on January 22, 2026. Cursor describes skills as SKILL.md files that can include commands, scripts, and instructions for specializing agent behavior. Cursor also has a strong repo-context stack: .cursor/rules, user rules, and AGENTS.md.

The important product distinction:

Always-on context:
  Claude Code: CLAUDE.md
  Cursor: .cursor/rules, AGENTS.md, user rules

On-demand workflow context:
  Claude Code: .claude/skills/<name>/SKILL.md
  Cursor: Agent Skills / SKILL.md support in editor and CLI

What The Benchmarks Say

1. SkillsBench: skills help, but unevenly

SkillsBench is one of the clearest public benchmarks for agent skills. It tested skill-augmented agents across many tasks and found that curated skills improved average pass rate by 16.2 percentage points.

That headline is encouraging, but the details matter:

Software engineering improved only +4.5 percentage points in the reported domain breakdown.
Some tasks got worse: 16 of 84 tasks showed negative deltas.
Self-generated skills showed no average benefit.
Focused skills with a few modules beat giant comprehensive documentation bundles.

My read: skills are strongest when they contain procedural knowledge the model does not already have and when the task really needs that procedure.

2. SWE-Skills-Bench: software engineering gains are often small

SWE-Skills-Bench is especially relevant because it isolates software engineering tasks. The results are sobering:

39 of 49 skills produced zero pass-rate improvement.
The average gain was only +1.2%.
Seven specialized skills had meaningful gains, up to +30%.
Three skills degraded performance, up to -10%.
Token overhead sometimes rose dramatically without improving pass rate.

This is the strongest warning against treating skills as a universal coding upgrade. In software engineering, a skill needs a precise fit: right framework, right version, right task, right abstraction level.

3. Vercel’s eval: `AGENTS.md` beat skills for framework docs

Vercel tested a Next.js docs skill against an AGENTS.md docs index. The result was surprising:

Configuration	Pass rate
Baseline, no docs	53%
Skill, default behavior	53%
Skill, explicit instruction	79%
`AGENTS.md` docs index	100%

This does not prove skills are bad. It proves that skill invocation is a decision point, and agents can fail to make that decision. For framework documentation that should guide every relevant coding step, persistent repo context can outperform on-demand retrieval.

That pattern is very important for Claude Code and Cursor users. If the agent must always know “we are on Next.js 16, use these APIs, avoid old patterns,” a compact AGENTS.md, CLAUDE.md, or .cursor/rules entry may beat a skill that the agent might forget to load.

4. Realistic skill retrieval is fragile

A later study on agentic skill usage in realistic settings tested retrieval from a large pool of real-world skills. It found that gains degrade when the agent has to search, select, and adapt skills by itself.

The same paper also showed a path forward: retrieval plus query-specific refinement improved Claude Opus 4.6 on Terminal-Bench 2.0 from 57.7% to 65.5%.

My read: skills work best when the agent is not merely handed a pile of files. It needs a good retrieval layer, good descriptions, and sometimes a refinement step that adapts the skill to the actual task.

5. Repo-level rules are also not automatically good

The evidence on repo-level files is mixed too. One AGENTS.md study found that context files tended to reduce task success and increase inference cost. Another study found AGENTS.md reduced median runtime and output tokens while keeping task behavior comparable.

There is also a useful finding from rule-file research: negative constraints such as “do not refactor unrelated code” looked more beneficial than broad positive guidance like “follow good style.”

So the practical message is not “put everything in AGENTS.md.” It is:

Keep always-on files compact.
Prefer constraints and navigation over long lectures.
Do not encode every workflow globally.
Move procedural workflows into skills only when they are narrow and measurable.

Claude Code vs Cursor: Practical Differences

Claude Code and Cursor now both support the same general idea: reusable agent instructions plus task-specific skills. The difference is the developer experience.

Question	Claude Code	Cursor
Best always-on file	`CLAUDE.md`	`.cursor/rules` or `AGENTS.md`
Best skill use	terminal workflows, code review, project runbooks, subagent workflows	IDE workflows, domain procedures, editor/CLI agent specialization
Risk	skill stays in session context after invocation, token cost accumulates	skill may not trigger unless the task clearly matches
Strong default path	explicit slash command or automatic load by description	editor/CLI discovery plus rules-driven context

For PMs, the important distinction is this: skills are not just “more context.” They are a productized workflow asset. A good skill captures how the team wants a task done, not just what the codebase looks like.

From Agent Chat to Reusable Skill

One of the most practical ways to find good skills is not to brainstorm a giant skill library upfront. It is to watch what happens in real agent sessions.

If a team repeatedly has to explain the same steps to an agent, or if a long debugging, review, migration, or release conversation produces a reliable procedure, that conversation is a signal. The useful move is to extract the stable procedure into a small skill.

The lifecycle looks like this:

agent chat
  -> repeated explanation or correction
  -> stable procedure emerges
  -> convert procedure into SKILL.md
  -> test on future similar tasks
  -> keep, trim, or remove based on results

This can reduce future token spend, but only in the specific sense that it may reduce repeated prompting, exploration, and correction. The skill itself still costs tokens when loaded. So the goal is not “add a skill to save tokens.” The goal is “avoid rediscovering the same workflow when a compact, well-triggered procedure would do.”

Good candidates sound like:

“Every time we ask for a release, we explain the same branch, changelog, tag, and verification steps.”
“Every API migration review turns into the same compatibility checklist.”
“Every design-system task needs the same component lookup and token rules.”
“Every incident write-up needs the same timeline, customer impact, root cause, and follow-up format.”

Bad candidates sound like:

“Make the agent code better.”
“Remember all our docs.”
“Use this huge archive of examples whenever it might be useful.”

Decision Framework

Use this simple rule:

Does the agent need this on almost every task?
  yes -> put it in CLAUDE.md / AGENTS.md / .cursor/rules
  no  -> continue

Is it a repeatable workflow with a clear trigger?
  yes -> make a skill
  no  -> use a one-off prompt or normal docs

Did the workflow emerge from repeated agent chats?
  yes -> extract the stable steps, not the whole transcript
  no  -> continue

Can you verify improvement with a small eval?
  yes -> keep and iterate
  no  -> treat it as experimental

A more concrete version:

Use case	Better default
Project architecture and stack	repo context
”Never edit generated files”	repo context
Version-specific framework docs index	repo context, maybe linked docs
Pull request review checklist	skill
Database migration recipe	skill
Release process with commands	skill
Procedure discovered through repeated agent chats	skill, after a small eval
One-time feature request	default agent
Broad “write better code” guidance	probably neither; use tests, lint, review

A Tiny Skill-Lift Eval

You do not need a giant benchmark to start. Create five to ten tasks your team actually asks agents to do, then run each task twice:

Default agent with normal repo context.
Same agent with the skill available or explicitly invoked.

Track pass/fail, tokens, elapsed time, and review comments.

For chat-derived skills, include at least one task where the default agent usually burns time rediscovering the procedure. The skill is worth keeping only if it reduces repeated explanation, correction, or review effort enough to justify the tokens it adds when loaded.

type EvalResult = {
  taskId: string;
  mode: "default-agent" | "with-skill";
  passed: boolean;
  elapsedMs: number;
  inputTokens: number;
  outputTokens: number;
  reviewerNotes: string;
};

function summarizeSkillLift(results: EvalResult[]) {
  const byMode = new Map<string, EvalResult[]>();

  for (const result of results) {
    byMode.set(result.mode, [...(byMode.get(result.mode) ?? []), result]);
  }

  for (const [mode, rows] of byMode) {
    const passRate = rows.filter((row) => row.passed).length / rows.length;
    const avgTokens =
      rows.reduce((sum, row) => sum + row.inputTokens + row.outputTokens, 0) /
      rows.length;

    console.log(`${mode}: ${(passRate * 100).toFixed(1)}% pass, ${avgTokens.toFixed(0)} avg tokens`);
  }
}

What you want is not “the skill was used.” You want evidence that the skill improved outcomes enough to justify its maintenance.

A Good Skill Shape

A good coding skill is narrow, navigable, and testable.

---
name: api-migration-review
description: Use when reviewing an API migration PR that changes routes, schemas, auth, or backward compatibility.
---

## Goal

Review an API migration for compatibility, safety, and rollout risk.

## Checklist

1. Identify changed routes and request/response schemas.
2. Check whether existing clients can still call the old contract.
3. Verify auth and permission changes.
4. Look for database migration coupling.
5. Confirm tests cover old and new behavior.

## Output

Return:
- blocking issues
- non-blocking concerns
- missing tests
- rollout risk: low, medium, or high

Notice what this skill does not do. It does not teach the whole backend architecture. It does not include every API guideline the company has ever written. It gives the agent a repeatable procedure for a specific review job.

The Article’s Research Audit

Here is the honest synthesis:

Evidence for skills: SkillsBench shows meaningful average lift from curated skills, and realistic retrieval plus refinement can improve Terminal-Bench performance.
Evidence against naive skills: SWE-Skills-Bench shows small average gains for software engineering, many zero-lift skills, and some harmful skills.
Evidence for always-on context: Vercel’s Next.js eval showed a compact AGENTS.md docs index outperforming skill-based retrieval.
Evidence against bloated always-on context: AGENTS.md studies disagree, with some showing reduced task success or increased cost.
Evidence for local evals: Anthropic’s own skill-creator updates emphasize testing, benchmarking, A/B comparison, token usage, and regression checks.

The pattern is consistent enough to act on:

Default agent
  good for: general coding, exploration, one-off changes

Repo context
  good for: stable constraints, architecture, commands, version facts

Skills
  good for: narrow procedures, repeated workflows, tool-backed actions

Evals
  good for: deciding whether any of the above actually helped

Current Gaps

I found a few gaps worth calling out:

Public, controlled benchmarks specifically for Cursor skills vs Cursor without skills are still limited. Cursor has strong agent eval work through CursorBench, but that research is about model and agent quality, not skill lift specifically.
Many public skill benchmarks test curated conditions. Real teams need messy tests: stale docs, duplicated rules, bad skill descriptions, version drift, and multiple skills competing for the same task.
We need more PM-friendly metrics: not only pass rate, but review time saved, defect rate, rework count, and how often the agent asks better questions.
Skill security deserves more attention. A skill can include scripts and operational instructions, so it should be reviewed like code.

Final Recommendation

For Claude Code and Cursor teams, I would start with this operating model:

Write a compact repo context file first.
Watch real agent sessions for repeated explanation loops.
Add skills only for repeated workflows that the team can name.
Keep each skill small enough that the agent can actually follow it.
Prefer skills with scripts, examples, or checklists over vague advice.
Run a tiny A/B eval before standardizing a skill across the team.

The best teams will not ask “should we use skills?” They will ask:

Which part of our agent workflow needs persistent context, which part needs a reusable procedure, and which part needs an executable check?

That is where skills become useful: not as decoration, but as measured workflow infrastructure.

Luis Mori Guerra

Recent Articles

Topics

Skills vs Default Agents: Do Coding Skills Really Improve Claude Code and Cursor?

TL;DR

What You Will Learn Here

The Problem

What Skills Are in Claude Code and Cursor

What The Benchmarks Say

1. SkillsBench: skills help, but unevenly

2. SWE-Skills-Bench: software engineering gains are often small

3. Vercel’s eval: `AGENTS.md` beat skills for framework docs

4. Realistic skill retrieval is fragile

5. Repo-level rules are also not automatically good

Claude Code vs Cursor: Practical Differences

From Agent Chat to Reusable Skill

Decision Framework

A Tiny Skill-Lift Eval

A Good Skill Shape

The Article’s Research Audit

Current Gaps

Final Recommendation

Source List

Search the blog

Luis Mori Guerra

Recent Articles

Topics

TL;DR

What You Will Learn Here

The Problem

What Skills Are in Claude Code and Cursor

What The Benchmarks Say

1. SkillsBench: skills help, but unevenly

2. SWE-Skills-Bench: software engineering gains are often small

3. Vercel’s eval: AGENTS.md beat skills for framework docs

4. Realistic skill retrieval is fragile

5. Repo-level rules are also not automatically good

Claude Code vs Cursor: Practical Differences

From Agent Chat to Reusable Skill

Decision Framework

A Tiny Skill-Lift Eval

A Good Skill Shape

The Article’s Research Audit

Current Gaps

Final Recommendation

Source List

3. Vercel’s eval: `AGENTS.md` beat skills for framework docs