Claude Code Skills 2.0: Built-In Evaluation and Testing

Before Skills 2.0, improving a skill meant running it, reading the output, guessing what to fix, and trying again. There was no structured way to know whether a change actually helped. Skills 2.0 adds built-in evaluation – you define criteria, run parallel tests, and get scored results that tell you exactly where your skill succeeds and where it falls short.

What Evals Actually Do

An eval runs your skill multiple times against specific criteria and returns a graded report. Not pass/fail – scored commentary on each criterion you define. The more specific your criteria, the more useful the results.

Vague: “Test my writing skill” Specific: “Test whether my writing skill consistently uses social proof, opens with a pain point, and includes a clear call to action based on the frameworks in my copywriting-toolkit.md reference file”

The difference between those two prompts is the difference between generic feedback and actionable insight.

The Workflow

Step 1: Build the skill. Use Claude Code’s skill creator or write skill.md directly. Define the skill name, trigger description, goal, required tools, reference files, and step-by-step process.

Step 2: Define evaluation criteria. Be explicit about what matters. Each criterion should be independently testable.

Step 3: Run the eval. Claude Code launches multiple sub-agents in parallel, runs your skill against a concrete task, and returns a structured HTML report with grades per criterion.

Step 4: Iterate. Read the report, adjust skill.md, re-run. Repeat until you consistently hit your target score.

Case Uses

1. Marketing Copy Quality Control

You have a skill that writes landing page copy using a persuasion toolkit reference file. But you are not sure whether the skill actually follows your frameworks or just produces generic marketing text.

Eval criteria:

Does every output reference the persuasion toolkit file?
Does it open with a concrete pain point (not a generic statement)?
Does it use curiosity gaps – teasing a result without revealing it?
Does it include social proof or founder-led stories?
Does it end with a specific call to action (not “learn more”)?

Run it:

Run 5 tests on my copywriting skill. Task: write landing page copy
for a B2B AI governance platform targeting CTOs. Grade each output on
these criteria:
1. Uses persuasion-toolkit.md reference file
2. Opens with concrete pain point
3. Creates curiosity gaps sustained over multiple sentences
4. Includes social proof or founder story
5. Ends with specific CTA

What you learn: The report might show that your skill nails pain points (5/5) but only uses curiosity gaps in 2 out of 5 runs. Now you know exactly what to emphasize in skill.md – add explicit instructions about sustaining information gaps.

2. Code Review Consistency

You built a skill that reviews pull requests and flags issues. It works, but you suspect it misses certain categories of problems or produces inconsistent feedback depending on the code it reviews.

Eval criteria:

Does it check for security vulnerabilities (SQL injection, XSS, command injection)?
Does it flag missing error handling at API boundaries?
Does it identify performance concerns (N+1 queries, unnecessary allocations)?
Does it comment on test coverage gaps?
Are suggestions actionable (specific line references, not vague advice)?

Run it:

Run 5 tests on my code-review skill. For each test, use a different
sample PR from our repo (mix of Elixir backend, JavaScript frontend,
and database migrations). Grade on:
1. Catches security vulnerabilities
2. Flags missing error handling
3. Identifies performance concerns
4. Notes test coverage gaps
5. Provides actionable suggestions with line references

What you learn: Maybe the skill catches security issues 4/5 times but only flags performance concerns 1/5. You can then update skill.md to include an explicit performance checklist: “Always check for N+1 queries in Ecto, unnecessary Enum.map chains, and missing database indexes.”

3. Customer Support Response Quality

You have a skill that drafts responses to customer support tickets. It needs to be empathetic, technically accurate, and follow your company’s tone guidelines. Getting one of those right is easy. Getting all three consistently is the hard part.

Eval criteria:

Does it acknowledge the customer’s frustration before jumping to solutions?
Is the technical explanation accurate and complete?
Does it match the tone guide (professional but not robotic, warm but not casual)?
Does it offer a concrete next step?
Does it avoid over-promising (no “this will definitely fix it”)?

Run it:

Run 5 tests on my support-response skill. Use these scenarios:
1. Customer locked out of their account for 3 days
2. Customer reporting data loss after an update
3. Customer angry about a billing error
4. Customer confused by a new feature
5. Customer requesting a feature that doesn't exist
Grade each on: empathy first, technical accuracy, tone guide match,
concrete next step, avoids over-promising

What you learn: The report might reveal the skill handles angry customers well (empathy scores high) but struggles with “feature doesn’t exist” scenarios – it over-promises or hints that the feature might be coming. You fix this by adding guardrails in skill.md: “When a requested feature does not exist, acknowledge the need, explain current alternatives, and direct to the feature request channel. Never imply the feature is planned unless confirmed.”

4. A/B Testing Skills Against Each Other

Beyond evaluating a single skill, you can compare two versions or test with vs. without a skill entirely.

Questions A/B testing answers:

Is the skill actually improving output quality, or just using more tokens?
Can you remove reference files without hurting quality?
Is a simpler version of the skill nearly as good as the full version?

A/B test my copywriting skill: Run 5 tests WITH the skill and 5
tests WITHOUT. Same task: write email copy for a product launch.
Compare on: persuasion technique usage, emotional resonance,
specificity of claims. Also report token usage and runtime per run.

The eval report includes benchmarking – per-run duration and token counts alongside quality grades. If the full skill scores 8/10 but uses 3x the tokens of a leaner version that scores 7/10, you have a real decision to make.

Reading the Report

The HTML evaluation report shows:

Each run with its full output
The original prompt used
Whether the skill was invoked (confirms the skill actually triggered)
Formal grades with detailed commentary per criterion
Pass/fail counts across all runs and criteria
Benchmarks for duration and token usage (in A/B tests)

A score of 6/12 (50%) is not a failure – it is a baseline. It tells you exactly which 6 criteria need work. Iterate skill.md, re-run, aim for 9/10 or better. Once you consistently hit that, the skill is production-ready.

Tips for Better Evals

Be specific about the task. “Write some copy” gives inconsistent results. “Write landing page copy for a B2B AI governance platform targeting CTOs at companies with 50-200 employees” gives something you can actually grade.

Tie criteria to reference files. If you have a tone-guide.md or persuasion-toolkit.md, reference it in your criteria. This tests whether the skill actually uses the files, not just whether the output sounds vaguely right.

Run enough instances. Five is a reasonable minimum. One run can be an outlier. Five runs reveal patterns.

Grade what matters, not everything. Three to six criteria is the sweet spot. More than that and the eval becomes noisy. Pick the criteria that separate “good enough” from “actually good.”

Iterate one thing at a time. Change one part of skill.md, re-run the eval, compare scores. If you change five things at once, you will not know which change helped.