How to evaluate an AI skill
You installed a skill. The README looks fine. The demo in the docs worked on the first try. Should you keep it?
You can’t tell from the README. You have to run the skill on your own work and look at what comes out.
This post walks through doing that. We start with the simplest possible version: one skill, one prompt, you read the output. Then we slowly add structure until you have ten prompts, a scoring sheet, and a folder you can rerun the next time the skill or the model changes.
You do not need an evals framework. You need claude, a text editor, and a small amount of patience.
The examples in this post use the claude CLI and the Claude Agent SDK because that’s what I happen to use. The shape of the process is the same with any agent. Swap claude for codex, gemini, or whatever else, swap the SDK call in the rubric judge, and the rest of the post works unchanged. The skill format, the failure modes, the prompts, the scoring, and the folder layout are all agent-agnostic.
Table of contents
- What a skill is, in two sentences
- The two ways a skill fails
- Skills that need a workspace, and skills that don’t
- Step 1: Run it once, by hand
- Step 2: Name the one job you actually care about
- Step 3: Write a handful of prompts
- Step 4: Run them, still by hand
- Step 5: Add structure, one layer at a time
- Step 6: Read the bad runs
- Step 7: Decide
- A worked mini-example:
humanizer - What to skip
- What to keep
What a skill is, in two sentences
A skill is a folder with a SKILL.md file. The file has a name, a description, and instructions, and the model loads it into context when the user’s request looks like the description.
Anthropic skills, OpenAI’s Codex skills, and most home-grown variants all work this way. If you can read a markdown file, you can read a skill.
The two ways a skill fails
Before we run anything, name the failure modes. They have nothing to do with each other and are usually fixed in different files.
- It does not fire when it should. The user asked for the right thing and the skill stayed silent. The model used general knowledge instead.
- It fires but the output is wrong. The skill was invoked, ran its instructions, and still produced something off-spec.
Hold these as separate scoreboards from now on. A skill with 100% trigger and 60% execution is not the same problem as one with 60% trigger and 100% execution. Conflating them is how people end up tuning a skill in the wrong direction for two weeks.
Skills that need a workspace, and skills that don’t
This is the part that confuses people on their first eval, and it changes how you set everything else up.
Some skills are pure transforms on text. You give them a paragraph in the prompt, you get a rewritten paragraph back. They don’t read files, they don’t run commands, they don’t care what directory claude is launched from. The humanizer skill is like this.
Other skills assume there’s a project to work on. They want to read files, run git log, look at a package.json, and write output into the directory. The release-notes skill is like this. Pointing it at an empty directory will produce nothing useful.
Before you write a single prompt, decide which kind of skill you have:
| Kind | What it needs on disk | How you point claude at it | Example |
|---|---|---|---|
| Text-in / text-out | Nothing. The input is in the prompt. | Run claude from anywhere, paste the input. | humanizer, sql-explainer |
| Workspace-required | A directory with files, possibly a git repo. | Run claude inside the directory, or pass --add-dir to a fixture copy. | release-notes, migration-writer, pr-review |
| Workspace-optional | Works either way. Reads files when there are some, asks for input when there aren’t. | Try both modes. They often have different failure shapes. | code-reviewer, doc-writer |
Confusing a workspace-required skill for a text-in skill is the single most common reason someone says “the skill never fires” on day one. The skill fired. It looked around, found nothing, and gave up.
For the rest of this post we will run two examples in parallel:
humanizer, the text-in case.release-notes, the workspace-required case.
Watch for the workspace-optional case in your own work. The eval shape is the same as the workspace-required one, but you should write at least two prompts that don’t supply a workspace and confirm the skill behaves sensibly when there’s nothing to read.
Step 1: Run it once, by hand
Skip the spreadsheet. Skip the scripts. The first thing you should do is run the skill once and look at the output.
For a text-in skill (humanizer)
Open a terminal anywhere. Pick a paragraph that you have actually written and that you think is too AI-flavored. Then:
claude --print "Use the humanizer skill to rewrite this paragraph: <paste>"
--print (or -p) is non-interactive mode. The agent runs, does its thing, and exits. You will see the rewritten paragraph in your terminal.
Now ask yourself, with no scoring rubric:
- Did the rewrite happen, or did the model just say “okay” and stop?
- Is the meaning preserved?
- Is the new version actually less AI-sounding to your ear?
That’s the eval. One prompt, three questions, two minutes. You already know more than you did before you ran it.
For a workspace-required skill (release-notes)
Pick a real repo (a side project works fine), cd into it, and run:
claude --print "Use the release-notes skill to write notes for the last 30 commits."
You should see a release notes file appear in the directory, or a markdown blob in the terminal. If it complains about permissions, add --permission-mode bypassPermissions while you’re evaluating. Don’t use that flag for real work.
Same three questions:
- Did the skill fire?
- Did it actually look at the git log? (Skim the transcript, or just check whether the bullets reflect real commits.)
- Is the output something you would ship?
By now you’ll have a hunch about whether this skill is worth keeping. The rest of the post is about turning a hunch into a repeatable answer.
Step 2: Name the one job you actually care about
A skill can do many things. You should evaluate one.
Write a single sentence in the form “given X, produce Y.” If you can’t finish that sentence in one breath, the skill is too vague to evaluate yet, and that itself is useful information.
Some examples:
| Skill | The one-sentence job |
|---|---|
humanizer | Given a paragraph of writing, rewrite it to remove AI patterns while preserving meaning. |
release-notes | Given a git repo and two refs, produce a markdown file with bullets grouped by Features, Fixes, Docs, and Breaking Changes. |
sql-explainer | Given a SQL query, produce a plain-English explanation of what it returns. |
pr-review | Given a git diff, produce a markdown review with sections for Summary, Concerns, and Suggestions. |
Notice that none of these are “help me with X.” Every one is a transform: input shape on the left, output shape on the right. That structure is what makes the skill gradable in the first place.
Step 3: Write a handful of prompts
Now expand from one prompt to about ten. Ten is enough to catch real problems. Fewer than that and you will miss things; more than that on day one and you will not run them.
Mix four kinds:
| Kind | What it tests | How many |
|---|---|---|
| Explicit | User names the skill or its job exactly. The skill must fire. | 3 |
| Implicit | User describes the job in their own words, no skill name. The skill should fire. | 3 |
| Adjacent | Sounds related but the skill must not fire. Catches false positives. | 2 |
| Edge | Empty input, broken context, ambiguous goal. Catches silent failures. | 2 |
For humanizer, a starter set:
id,should_trigger,prompt
01,true,"Use the humanizer skill to rewrite this paragraph: <AI-ish text>"
02,true,"Edit this paragraph to sound less like AI: <AI-ish text>"
03,true,"Rewrite to remove AI patterns: <AI-ish text>"
04,true,"Make this less generative-AI-flavored: <AI-ish text>"
05,true,"Help this read like a human wrote it: <AI-ish text>"
06,true,"Strip the corporate-blog tone from this: <AI-ish text>"
07,false,"Translate this paragraph to Spanish: <AI-ish text>"
08,false,"Make this paragraph more concise: <AI-ish text>"
09,true,"Rewrite this <AI-ish text>"
10,true,"Humanize."
Rows 7 and 8 are the negative controls. Row 7 is a translation request; row 8 is a generic edit. Both must stay quiet, or the skill is firing too eagerly. Row 8 is the interesting one for humanizer: an old version of this skill fired on any “edit this” request because the description said it triggers on “editing or reviewing.” Tightening the description fixed it.
For release-notes, the equivalent set:
id,should_trigger,prompt
01,true,"Use the release-notes skill to write notes from v1.1 to HEAD."
02,true,"Generate release notes for v2.0..HEAD."
03,true,"Run release-notes against the last 50 commits."
04,true,"Make me a CHANGELOG entry covering everything since v3."
05,true,"Summarize what shipped between the last release and now in markdown, grouped by category."
06,true,"I need release notes for the v0.9 to v1.0 jump."
07,false,"What were the last 5 commits about? Just a quick summary."
08,false,"Open CHANGELOG.md and let me dictate the bullets. Don't generate them."
09,true,"Generate release notes. Use HEAD~30..HEAD if no range is specified."
10,true,"Make release notes."
Row 10 is the cheeky one: no range, no context. Did your description tell the skill to ask for a range when none is given? If not, the skill might invent one or fail silently. Both outcomes are fine, but you should know which you get before you ship.
Step 4: Run them, still by hand
You don’t need the runner script yet. Open ten terminals if you want. Or run them one at a time and paste the output into a notes file. The point of doing it by hand the first time is that you’ll notice things a script will hide.
For each run, write down two things in a notebook or markdown file:
- Triggered? Yes or no. (You can usually see this in the transcript: the model mentions the skill by name when it fires. If you’re not sure, ask “Did you use a skill to answer this?” in a follow-up.)
- Good output? Yes or no, with a one-line note on what was wrong.
That’s a complete first-pass scorecard. Two columns, ten rows. Five minutes to fill out, and you’ll already know if the skill is worth keeping.
If both columns are mostly Yes, congratulations. The skill is doing its job and you can stop here for now. Bookmark this post for the next time you change skills or models.
If either column is mostly No, the rest of this post is about turning that observation into specific edits.
Step 5: Add structure, one layer at a time
You’ve reached the point where doing this by hand is annoying. Now we automate, in three layers, in this order: the runner, the deterministic checks, then the rubric judge. Don’t skip ahead.
5a. The runner
The goal: run all ten prompts, save each transcript, and (if it’s a workspace skill) give each run its own clean directory.
Here is the workspace version, for release-notes:
#!/usr/bin/env bash
set -euo pipefail
RUN_DIR="runs/$(date +%Y-%m-%d_%H%M)"
mkdir -p "$RUN_DIR"
while IFS=, read -r id should_trigger prompt; do
[ "$id" = "id" ] && continue
case_dir="$RUN_DIR/$id"
mkdir -p "$case_dir/workspace"
cp -R fixtures/sample-repo/. "$case_dir/workspace/"
claude \
--print \
--output-format json \
--permission-mode bypassPermissions \
--add-dir "$case_dir/workspace" \
"$prompt" \
> "$case_dir/transcript.json"
echo "$id,$should_trigger" >> "$RUN_DIR/manifest.csv"
done < prompts.csv
And the simpler version for a text-only skill like humanizer. No fixture, no --add-dir, no per-case workspace:
#!/usr/bin/env bash
set -euo pipefail
RUN_DIR="runs/$(date +%Y-%m-%d_%H%M)"
mkdir -p "$RUN_DIR"
while IFS=, read -r id should_trigger prompt; do
[ "$id" = "id" ] && continue
case_dir="$RUN_DIR/$id"
mkdir -p "$case_dir"
claude \
--print \
--output-format json \
"$prompt" \
> "$case_dir/transcript.json"
echo "$id,$should_trigger" >> "$RUN_DIR/manifest.csv"
done < prompts.csv
Things worth knowing about the flags:
--printis non-interactive mode. Without it,claudeopens a chat.--output-format jsongives you a structured object with the full message history, tool calls, and token counts. Every later check grades against this object.--permission-mode bypassPermissionsstops the agent from pausing on every file write. Use it for evals, not for real work.--add-dirputs the case workspace inside the agent’s allowed paths. Each case gets its own directory so they don’t contaminate each other. You only need this when the skill writes to disk.
Run the script. Grab a coffee. You’ll have ten transcripts in about as many minutes.
5b. Deterministic checks
These don’t need a model. They answer cheap, reliable questions like:
- Did the skill name appear in the transcript? (Did it fire?)
- Did the agent run
git log, or whatever the skill is supposed to call? - Does the output file exist?
- Does it contain the expected section headings?
A small Python checker for release-notes:
import json, sys, pathlib
case_dir = pathlib.Path(sys.argv[1])
transcript = json.loads((case_dir / "transcript.json").read_text())
workspace = case_dir / "workspace"
def used_skill(transcript, name):
return name in json.dumps(transcript)
def ran_command(transcript, needle):
for msg in transcript.get("messages", []):
for block in msg.get("content", []):
if block.get("type") == "tool_use" and block.get("name") == "Bash":
if needle in block.get("input", {}).get("command", ""):
return True
return False
def output_file():
for path in workspace.glob("**/*.md"):
if "release" in path.name.lower() or "changelog" in path.name.lower():
return path
return None
out = output_file()
text = out.read_text() if out else ""
checks = {
"skill_invoked": used_skill(transcript, "release-notes"),
"ran_git_log": ran_command(transcript, "git log"),
"produced_markdown_file": out is not None,
"has_features_heading": "## Features" in text or "### Features" in text,
"has_fixes_heading": "## Fixes" in text or "### Fixes" in text,
"no_merge_commits_in_output": "Merge pull request" not in text,
}
for k, v in checks.items():
print(f"{'PASS' if v else 'FAIL'} {k}")
Six checks, each yes/no. Run it across all ten cases and you have the spine of a scorecard.
For a text-in skill like humanizer, the deterministic checks look at the response text directly instead of files on disk. The transcript from --output-format json includes a result field with the final assistant text:
import json, sys, pathlib
case_dir = pathlib.Path(sys.argv[1])
transcript = json.loads((case_dir / "transcript.json").read_text())
text = transcript.get("result", "")
checks = {
"skill_invoked": "humanizer" in json.dumps(transcript),
"no_em_dashes": "\u2014" not in text,
"no_tapestry": "tapestry" not in text.lower(),
"no_vibrant": "vibrant" not in text.lower(),
"no_underscore_verb": "underscore" not in text.lower(),
}
for k, v in checks.items():
print(f"{'PASS' if v else 'FAIL'} {k}")
Same shape, no workspace involved. The check operates on the transcript instead of on files.
This is also where you catch things you didn’t think to test. The first time you run this, you’ll probably discover that one case “passed” because a leftover CHANGELOG.md was sitting in the fixture from a previous experiment. That’s why every workspace case gets its own clean directory.
5c. A rubric judge using the Agent SDK
Some things are not file checks. “Are bullets grouped under the right headings?” or “Did the rewrite preserve the paragraph’s main claim?” need a reader. Claude is a fine reader if you give it a short rubric and ask for a structured answer.
For release-notes the judge needs the workspace because the artifact is a file:
import asyncio, pathlib, sys
from claude_agent_sdk import query, ClaudeAgentOptions
case_dir = pathlib.Path(sys.argv[1])
workspace = case_dir / "workspace"
rubric = """
Grade the release-notes markdown file in the working directory.
1. correct_grouping: Each bullet is filed under the right heading.
2. user_facing_language: Bullets describe outcomes, not raw commit subjects.
3. trivial_commits_excluded: WIP, typo, lint, formatting, and merge commits are absent.
4. breaking_changes_surfaced: BREAKING CHANGE commits appear under a Breaking Changes section.
Reply with one JSON object. No prose.
{
"overall_pass": <bool>,
"criteria": {
"correct_grouping": {"pass": <bool>, "notes": "<one sentence>"},
"user_facing_language": {"pass": <bool>, "notes": "<one sentence>"},
"trivial_commits_excluded": {"pass": <bool>, "notes": "<one sentence>"},
"breaking_changes_surfaced": {"pass": <bool>, "notes": "<one sentence>"}
}
}
"""
async def grade():
options = ClaudeAgentOptions(
cwd=str(workspace),
permission_mode="bypassPermissions",
allowed_tools=["Read", "Glob", "Grep"],
)
final = ""
async for msg in query(prompt=rubric, options=options):
for block in getattr(msg, "content", []) or []:
if getattr(block, "type", "") == "text":
final = block.text
print(final)
asyncio.run(grade())
For humanizer the judge doesn’t need any tools because the artifact lives in the transcript. Pull the rewritten text out and pass it inline:
import asyncio, csv, json, pathlib, sys
from claude_agent_sdk import query, ClaudeAgentOptions
case_dir = pathlib.Path(sys.argv[1])
case_id = case_dir.name
transcript = json.loads((case_dir / "transcript.json").read_text())
prompts = {row["id"]: row["prompt"] for row in csv.DictReader(open("prompts.csv"))}
original_prompt = prompts[case_id]
rewritten = transcript.get("result", "")
rubric = f"""
Grade this rewrite.
1. preserves_meaning: The main claim of the original is still present.
2. less_ai_flavored: Avoids "tapestry," "vibrant," "underscore," em dashes, rule-of-three.
3. natural_voice: Reads like something a human would say out loud.
ORIGINAL PROMPT:
{original_prompt}
AGENT'S REWRITE:
{rewritten}
Reply with one JSON object. No prose.
{{
"overall_pass": <bool>,
"criteria": {{
"preserves_meaning": {{"pass": <bool>, "notes": "<one sentence>"}},
"less_ai_flavored": {{"pass": <bool>, "notes": "<one sentence>"}},
"natural_voice": {{"pass": <bool>, "notes": "<one sentence>"}}
}}
}}
"""
async def grade():
options = ClaudeAgentOptions(allowed_tools=[])
final = ""
async for msg in query(prompt=rubric, options=options):
for block in getattr(msg, "content", []) or []:
if getattr(block, "type", "") == "text":
final = block.text
print(final)
asyncio.run(grade())
Two things to call out in both versions:
- The judge is read-only. For workspace skills, restrict it to
Read,Glob, andGrep. It must not be able to edit the artifact it’s grading. For text-only skills, give it no tools at all. - Structured output, not numeric. The judge returns a JSON object with one boolean per criterion. Don’t ask for a 1-to-100 score. You’ll spend a week trying to figure out why the score moved from 78 to 73, and the answer will always be “vibes.”
Run the grader against each case. Combine its output with the deterministic checks. You now have a scorecard per case.
Step 6: Read the bad runs
This is the part that shouldn’t be skipped.
Open the transcripts where things failed and read the actual messages. You’re looking for things like:
- The skill description matched the wrong prompt. Fix: tighten the description.
- The skill fired but the agent ignored a step. Fix: make the step less skippable in the SKILL.md.
- The skill fired and did the wrong thing because the user gave ambiguous input. Fix: add an “if the user does not specify X, ask first” rule.
- The skill never fired and you can’t tell why. Fix: read the description out loud and ask whether a stranger would match the user’s prompt to it.
Langfuse’s writeup on the same loop has a good story about this. They found their skill stopped firing at all when they made the description more “abstract” because the model could no longer match keywords. They reverted, and the regression went away. You only catch that kind of thing by reading the trace, not by staring at the score.
Step 7: Decide
You have a scorecard. Now do the boring grown-up part:
- If trigger accuracy is bad, edit the description in
SKILL.md. Re-run. Compare. - If execution is bad, edit the body of
SKILL.md. Re-run. Compare. - If neither is the problem and the skill is just slow or expensive, look at the token counts in the transcripts. Trim the body. Re-run. Compare.
The whole point of having a fixture is that “compare” is a real verb. You ran the same prompts before, you ran the same prompts after, and the numbers either moved or they didn’t.
If after two or three iterations the skill is still under-performing on your work, that is also a real answer. Uninstall it. Pick a different one. Or write a smaller skill that does only the part you need. None of those are losses; they’re the eval doing its job.
A worked mini-example: humanizer
To make this concrete, here’s one I actually ran. The skill is humanizer: a SKILL.md that detects AI writing patterns (em dash overuse, “tapestry,” promotional puffery, that sort of thing) and rewrites them.
I started with three prompts (the by-hand version of Step 4):
- “Edit this paragraph to sound less like AI: [obviously AI-ish paragraph].” Should fire.
- “Translate this paragraph to Spanish.” Should not fire.
- “Make this paragraph more concise.” Borderline. I wanted to see what happened.
Five deterministic checks:
- Did the skill fire on case 1? Yes.
- Did it stay silent on case 2? Yes, the model just translated.
- On case 1, did the output still contain “tapestry,” “vibrant,” or “underscore”? No.
- On case 1, did the output contain any em dashes? No.
- On case 1, did the output preserve the main claim? Judged by Claude with a small rubric. Yes.
Case 3 was the interesting one. The skill fired and tried to humanize the paragraph when the user only asked for “more concise.” Reading the transcript, the description in SKILL.md said the skill triggered on “editing or reviewing” any text. That was too broad. I tightened it to “use when the user explicitly asks to remove AI-sounding language or to make text sound more human,” and the skill stopped firing on case 3. Trigger accuracy went from 2/3 to 3/3. Total time, including coffee: about an hour.
That is what a real eval looks like. Small, specific, useful.
What to skip
In the order I see people get nerd-sniped:
- Big benchmark suites. You don’t need them. Ten of your prompts beat a thousand of someone else’s.
- Single-number scores. “It got 87.” 87 of what? 87 from when?
- Eval frameworks before you have ten prompts. The framework is a distraction from writing the prompts. Write the prompts first.
- Grading every nuance. If you can’t turn it into a yes/no question, leave it out of the rubric.
- Eval-driven prompt engineering. Tuning the skill so the fixture passes is just Goodharting yourself. The fixture is a sample of your work, not a target.
What to keep
The whole eval lives in one folder per skill. Here’s the layout I use, with one case filled in so nothing is left to imagination:
evals/
└── release-notes/ # one folder per skill under eval
├── prompts.csv # the 10 prompts from Step 3
├── rubric.md # plain-text version of the judge rubric
├── history.md # one line per run, newest at the top
├── fixtures/ # only present for workspace skills
│ └── sample-repo/
│ └── .git/...
├── scripts/
│ ├── run.sh # the runner from Step 5a
│ ├── check_deterministic.py # the checker from Step 5b
│ └── check_rubric.py # the SDK judge from Step 5c
└── runs/
├── 2026-05-06_1430_baseline/
│ ├── manifest.txt # date, model, claude version, skill version, input SHA
│ ├── summary.md # one-page writeup of what changed
│ ├── scores.csv # one row per case, one column per check
│ └── cases/
│ ├── 01/
│ │ ├── transcript.json
│ │ ├── deterministic.txt
│ │ ├── rubric.json
│ │ ├── input.txt # commit SHA + branch link, or path to fixture
│ │ └── artifacts/ # only the files the skill produced or modified
│ │ └── RELEASE_NOTES.md
│ └── ...10/
└── 2026-05-06_1700_v2-tighter-description/
└── ... # same shape, different run
For a text-only skill the layout is the same minus the fixtures/ directory and the per-case artifacts/ folder (the artifact for a text-only skill is just the final string in transcript.json). Keep everything else.
A few rules that make this layout work in practice:
- One folder per run, named
YYYY-MM-DD_HHMM_label. The label is a short slug describing what changed since the last run (baseline,v2-tighter-description,gpt-5.5-instead). When you compare two runs three months from now, the label is what you’ll read first. manifest.txtis filled out before the run, not after. It records the model, theclaude --versionoutput, the skill version (commit SHA if it lives in git), and which other skills were enabled. Without this you cannot tell whether a regression came from the skill, the model, or your environment.- Keep the artifacts, not the workspace. Save only the files the skill produced or modified into
cases/<id>/artifacts/. These are small and they’re what you actually grade. Do not commitnode_modules, build output, LFS blobs, or the unmodified rest of the project. - Reference the input state, don’t snapshot it. What you do need is a way to recreate the input the skill saw. How you record it depends on which kind of fixture this is:
- Synthetic fixture repo (small, lives in
fixtures/sample-repo/): record the path and the commit SHA inside that fixture. The fixture is already in the eval repo, so the SHA is enough. - Real repo (a side project, a work repo, anything you wouldn’t want to duplicate): tag the input state in that repo with a branch named
eval/<skill>/<run-label>/<case-id>and write the repo URL plus that branch intoinput.txt. Pushing the branch is what makes the state durable; without it, normal branch deletion or force-pushes can erase what you graded against. - Either way, also record the commit SHA in
manifest.txtso you can find the state even if the branch is later deleted.
- Synthetic fixture repo (small, lives in
- A working
input.txtfor a real repo:repo: https://github.com/me/myproject branch: eval/release-notes/2026-05-06_1430/case-01 sha: 7c3f2a1 - The deal you’re making with this layout: the artifact is reproducible from
input.txtplus the recorded model and skill versions. Ifinput.txtever points at a state you can no longer fetch, the case is no longer reproducible. That is sometimes acceptable (the grade still happened, the artifact is still in the folder), but it is the failure mode to design around. For anything you genuinely want to be able to rerun in a year, use a synthetic fixture; reserve real-repo branches for cases where reproducibility is nice-to-have, not load-bearing. scores.csvis the same shape across every run. One row per case, one column per check, plustriggeredandexecutedcolumns. Same columns means you candifftwo runs directly.summary.mdis one page, written by you. Three sections: what changed, what the numbers say, what surprised you. If you can’t summarize a run on one page, you ran too many cases or you aren’t reading the transcripts.history.mdis the index across runs, written by you, lazy. One line per run, newest at the top. Each line: date, label, score delta from the previous run, one sentence on what changed, link to the run’ssummary.md. This is the only file in the layout that lets you spot trends across runs without opening each folder. Only add an entry when the run taught you something. A run that confirmed nothing changed gets a one-line entry or no entry at all. The momenthistory.mdstarts growing into paragraphs, that content belongs in a run’s ownsummary.mdand you should move it.
A working history.md looks like this:
# release-notes eval history
## 2026-05-06 v2-tighter-description (trigger 10/10, exec 8/10, +2 trigger)
Tightened SKILL.md description to require an explicit ref range. Trigger false positive on case 8 went away. [details](runs/2026-05-06_1700_v2-tighter-description/summary.md)
## 2026-05-06 baseline (trigger 8/10, exec 8/10)
First run. Skill fires too eagerly on adjacent prompts (case 7, 8). Output quality is fine when it does fire. [details](runs/2026-05-06_1430_baseline/summary.md)
Everything else (graphs, dashboards, CI integration) is optional and arrives later, when you’ve actually used this for two or three iterations and miss something specific.
That’s enough infrastructure to evaluate any skill, indefinitely, without ever opening a tool with the word “platform” in its name. You won’t get more accurate by adding tooling. You’ll get more accurate by writing better prompts and reading more transcripts.
The whole eval is just: did this skill make my work better, on my work, today? Everything in this post is an answer to that one question.