AI Coding Workflows

Skills vs Default Agents: Do Coding Skills Really Improve Claude Code and Cursor?

A practical research audit for engineers and PMs: when agent skills improve coding agents, when default agents plus repo context are enough, and how to evaluate the tradeoff.

12 min read

TL;DR

Skills are useful, but they are not magic. The best public evidence says narrow, curated, well-triggered skills can improve agent performance, while broad or poorly selected skills often behave like expensive prompt decoration.

For coding work in Claude Code and Cursor, my current recommendation is:

  • Use repo-level context (CLAUDE.md, AGENTS.md, .cursor/rules) for facts the agent should always know: stack, architecture, test commands, constraints, and “do not do this” guardrails.
  • Use skills for repeatable workflows with a clear entry point: code review, release notes, migration planning, API scaffolding, runbook execution, design-system usage, test-generation recipes.
  • Do not assume a skill helps because it feels organized. Benchmark it against a no-skill baseline.
  • Be especially careful with self-generated or marketplace skills. Several studies show that irrelevant, broad, or version-mismatched skills can provide zero gain or even reduce performance.

The dilemma is real, but the answer is not “skills vs no skills.” The answer is persistent context for stable rules, skills for task-specific procedures, and evals for proof.

What You Will Learn Here

  • What “skills” mean in Claude Code and Cursor.
  • What real benchmarks say about skill-augmented coding agents.
  • Why AGENTS.md or CLAUDE.md sometimes beats skills.
  • Where skills still make sense.
  • A practical decision framework for engineers and PMs.
  • A small code example for measuring skill lift in your own repo.

The Problem

AI coding agents are getting better even when you run them with no custom configuration. Claude Code and Cursor can already inspect files, run commands, edit code, use tools, and reason across a repository.

So the natural question is:

Should we invest in skills, or are default agents already good enough?

This question matters because skills are not free. They add maintenance, token overhead, trigger uncertainty, version drift, and sometimes security risk. But they can also encode workflow knowledge that a default model will not reliably infer.

Here is the useful mental model:

                 stable and always relevant
                           |
                           v
    CLAUDE.md / AGENTS.md / .cursor/rules
      architecture, conventions, constraints
                           |
                           v
                 coding agent session
                           |
                           v
               task-specific skill loads
       review workflow, migration recipe, template,
       domain checklist, release process, test harness

If the information should shape almost every task, it probably belongs in repo context. If it should activate only during a specific workflow, it is a skill candidate.

What Skills Are in Claude Code and Cursor

In Claude Code, a skill is a folder containing a SKILL.md file. Claude uses the skill description to decide when to load it, and the full skill content enters the conversation when invoked. Claude Code also supports supporting files like templates, examples, reference docs, and scripts.

.claude/skills/code-review/
  SKILL.md
  checklist.md
  examples/
    good-review.md
  scripts/
    collect-diff.sh

Cursor added Agent Skills support in Cursor 2.4, released on January 22, 2026. Cursor describes skills as SKILL.md files that can include commands, scripts, and instructions for specializing agent behavior. Cursor also has a strong repo-context stack: .cursor/rules, user rules, and AGENTS.md.

The important product distinction:

Always-on context:
  Claude Code: CLAUDE.md
  Cursor: .cursor/rules, AGENTS.md, user rules

On-demand workflow context:
  Claude Code: .claude/skills/<name>/SKILL.md
  Cursor: Agent Skills / SKILL.md support in editor and CLI

What The Benchmarks Say

1. SkillsBench: skills help, but unevenly

SkillsBench is one of the clearest public benchmarks for agent skills. It tested skill-augmented agents across many tasks and found that curated skills improved average pass rate by 16.2 percentage points.

That headline is encouraging, but the details matter:

  • Software engineering improved only +4.5 percentage points in the reported domain breakdown.
  • Some tasks got worse: 16 of 84 tasks showed negative deltas.
  • Self-generated skills showed no average benefit.
  • Focused skills with a few modules beat giant comprehensive documentation bundles.

My read: skills are strongest when they contain procedural knowledge the model does not already have and when the task really needs that procedure.

2. SWE-Skills-Bench: software engineering gains are often small

SWE-Skills-Bench is especially relevant because it isolates software engineering tasks. The results are sobering:

  • 39 of 49 skills produced zero pass-rate improvement.
  • The average gain was only +1.2%.
  • Seven specialized skills had meaningful gains, up to +30%.
  • Three skills degraded performance, up to -10%.
  • Token overhead sometimes rose dramatically without improving pass rate.

This is the strongest warning against treating skills as a universal coding upgrade. In software engineering, a skill needs a precise fit: right framework, right version, right task, right abstraction level.

3. Vercel’s eval: AGENTS.md beat skills for framework docs

Vercel tested a Next.js docs skill against an AGENTS.md docs index. The result was surprising:

ConfigurationPass rate
Baseline, no docs53%
Skill, default behavior53%
Skill, explicit instruction79%
AGENTS.md docs index100%

This does not prove skills are bad. It proves that skill invocation is a decision point, and agents can fail to make that decision. For framework documentation that should guide every relevant coding step, persistent repo context can outperform on-demand retrieval.

That pattern is very important for Claude Code and Cursor users. If the agent must always know “we are on Next.js 16, use these APIs, avoid old patterns,” a compact AGENTS.md, CLAUDE.md, or .cursor/rules entry may beat a skill that the agent might forget to load.

4. Realistic skill retrieval is fragile

A later study on agentic skill usage in realistic settings tested retrieval from a large pool of real-world skills. It found that gains degrade when the agent has to search, select, and adapt skills by itself.

The same paper also showed a path forward: retrieval plus query-specific refinement improved Claude Opus 4.6 on Terminal-Bench 2.0 from 57.7% to 65.5%.

My read: skills work best when the agent is not merely handed a pile of files. It needs a good retrieval layer, good descriptions, and sometimes a refinement step that adapts the skill to the actual task.

5. Repo-level rules are also not automatically good

The evidence on repo-level files is mixed too. One AGENTS.md study found that context files tended to reduce task success and increase inference cost. Another study found AGENTS.md reduced median runtime and output tokens while keeping task behavior comparable.

There is also a useful finding from rule-file research: negative constraints such as “do not refactor unrelated code” looked more beneficial than broad positive guidance like “follow good style.”

So the practical message is not “put everything in AGENTS.md.” It is:

  • Keep always-on files compact.
  • Prefer constraints and navigation over long lectures.
  • Do not encode every workflow globally.
  • Move procedural workflows into skills only when they are narrow and measurable.

Claude Code vs Cursor: Practical Differences

Claude Code and Cursor now both support the same general idea: reusable agent instructions plus task-specific skills. The difference is the developer experience.

QuestionClaude CodeCursor
Best always-on fileCLAUDE.md.cursor/rules or AGENTS.md
Best skill useterminal workflows, code review, project runbooks, subagent workflowsIDE workflows, domain procedures, editor/CLI agent specialization
Riskskill stays in session context after invocation, token cost accumulatesskill may not trigger unless the task clearly matches
Strong default pathexplicit slash command or automatic load by descriptioneditor/CLI discovery plus rules-driven context

For PMs, the important distinction is this: skills are not just “more context.” They are a productized workflow asset. A good skill captures how the team wants a task done, not just what the codebase looks like.

Decision Framework

Use this simple rule:

Does the agent need this on almost every task?
  yes -> put it in CLAUDE.md / AGENTS.md / .cursor/rules
  no  -> continue

Is it a repeatable workflow with a clear trigger?
  yes -> make a skill
  no  -> use a one-off prompt or normal docs

Can you verify improvement with a small eval?
  yes -> keep and iterate
  no  -> treat it as experimental

A more concrete version:

Use caseBetter default
Project architecture and stackrepo context
”Never edit generated files”repo context
Version-specific framework docs indexrepo context, maybe linked docs
Pull request review checklistskill
Database migration recipeskill
Release process with commandsskill
One-time feature requestdefault agent
Broad “write better code” guidanceprobably neither; use tests, lint, review

A Tiny Skill-Lift Eval

You do not need a giant benchmark to start. Create five to ten tasks your team actually asks agents to do, then run each task twice:

  1. Default agent with normal repo context.
  2. Same agent with the skill available or explicitly invoked.

Track pass/fail, tokens, elapsed time, and review comments.

type EvalResult = {
  taskId: string;
  mode: "default-agent" | "with-skill";
  passed: boolean;
  elapsedMs: number;
  inputTokens: number;
  outputTokens: number;
  reviewerNotes: string;
};

function summarizeSkillLift(results: EvalResult[]) {
  const byMode = new Map<string, EvalResult[]>();

  for (const result of results) {
    byMode.set(result.mode, [...(byMode.get(result.mode) ?? []), result]);
  }

  for (const [mode, rows] of byMode) {
    const passRate = rows.filter((row) => row.passed).length / rows.length;
    const avgTokens =
      rows.reduce((sum, row) => sum + row.inputTokens + row.outputTokens, 0) /
      rows.length;

    console.log(`${mode}: ${(passRate * 100).toFixed(1)}% pass, ${avgTokens.toFixed(0)} avg tokens`);
  }
}

What you want is not “the skill was used.” You want evidence that the skill improved outcomes enough to justify its maintenance.

A Good Skill Shape

A good coding skill is narrow, navigable, and testable.

---
name: api-migration-review
description: Use when reviewing an API migration PR that changes routes, schemas, auth, or backward compatibility.
---

## Goal

Review an API migration for compatibility, safety, and rollout risk.

## Checklist

1. Identify changed routes and request/response schemas.
2. Check whether existing clients can still call the old contract.
3. Verify auth and permission changes.
4. Look for database migration coupling.
5. Confirm tests cover old and new behavior.

## Output

Return:
- blocking issues
- non-blocking concerns
- missing tests
- rollout risk: low, medium, or high

Notice what this skill does not do. It does not teach the whole backend architecture. It does not include every API guideline the company has ever written. It gives the agent a repeatable procedure for a specific review job.

The Article’s Research Audit

Here is the honest synthesis:

  • Evidence for skills: SkillsBench shows meaningful average lift from curated skills, and realistic retrieval plus refinement can improve Terminal-Bench performance.
  • Evidence against naive skills: SWE-Skills-Bench shows small average gains for software engineering, many zero-lift skills, and some harmful skills.
  • Evidence for always-on context: Vercel’s Next.js eval showed a compact AGENTS.md docs index outperforming skill-based retrieval.
  • Evidence against bloated always-on context: AGENTS.md studies disagree, with some showing reduced task success or increased cost.
  • Evidence for local evals: Anthropic’s own skill-creator updates emphasize testing, benchmarking, A/B comparison, token usage, and regression checks.

The pattern is consistent enough to act on:

Default agent
  good for: general coding, exploration, one-off changes

Repo context
  good for: stable constraints, architecture, commands, version facts

Skills
  good for: narrow procedures, repeated workflows, tool-backed actions

Evals
  good for: deciding whether any of the above actually helped

Current Gaps

I found a few gaps worth calling out:

  • Public, controlled benchmarks specifically for Cursor skills vs Cursor without skills are still limited. Cursor has strong agent eval work through CursorBench, but that research is about model and agent quality, not skill lift specifically.
  • Many public skill benchmarks test curated conditions. Real teams need messy tests: stale docs, duplicated rules, bad skill descriptions, version drift, and multiple skills competing for the same task.
  • We need more PM-friendly metrics: not only pass rate, but review time saved, defect rate, rework count, and how often the agent asks better questions.
  • Skill security deserves more attention. A skill can include scripts and operational instructions, so it should be reviewed like code.

Final Recommendation

For Claude Code and Cursor teams, I would start with this operating model:

  1. Write a compact repo context file first.
  2. Add skills only for repeated workflows that the team can name.
  3. Keep each skill small enough that the agent can actually follow it.
  4. Prefer skills with scripts, examples, or checklists over vague advice.
  5. Run a tiny A/B eval before standardizing a skill across the team.

The best teams will not ask “should we use skills?” They will ask:

Which part of our agent workflow needs persistent context, which part needs a reusable procedure, and which part needs an executable check?

That is where skills become useful: not as decoration, but as measured workflow infrastructure.

Source List