AI coding agents are fast, but they cut corners. Agent Skills is an open-source project by Addy Osmani that gives agents the same structured workflows senior engineers follow, from spec to ship. This post breaks down how it works, explains the Google engineering principles it builds on (Hyrum’s Law, Chesterton’s Fence, the Beyonce Rule, Shift Left), and offers a balanced take on where it helps and where it falls short.

Table of contents

AI coding agents are fast. Scarily fast. Give one a feature request and it’ll produce hundreds of lines in seconds. But speed without discipline creates a specific kind of mess: code that works on the demo, passes the happy path, and falls apart the moment a real user touches it.

The problem isn’t that agents can’t write good code. It’s that they default to the shortest path. They skip specs because nobody told them to write one. They skip tests because the function already runs. They skip security reviews because, well, the prompt didn’t mention security. They optimize for “done” over “correct,” and if you’ve ever had to maintain a codebase an agent built without guardrails, you know exactly how that feels.

Agent Skills, an open-source project by Addy Osmani (Director of Engineering at Google), tries to fix this. It’s a collection of 20 engineering skills and 7 slash commands that encode the workflows senior engineers actually follow when building production software. Think of it as a curriculum for your AI agent, one that teaches it when to write a spec, how to break down tasks, what to test, how to review code, and when to ship.

The project has picked up over 14,000 stars on GitHub and works across Claude Code, Cursor, Gemini CLI, GitHub Copilot, and any agent that reads Markdown. That last part matters: skills are just .md files. No SDK, no runtime dependency. You can fork it tomorrow and owe nobody anything.

The core idea: process over prose

Most attempts to improve AI coding output start with better prompts. “Write clean code.” “Follow best practices.” “Make sure it’s secure.” These instructions are too vague to be actionable, and agents treat them accordingly, the same way a junior developer might nod along during a code review without changing anything.

Agent Skills takes a different approach. Instead of telling the agent what good code looks like, it tells the agent how to get there. Each skill is a structured workflow with concrete steps, checkpoints, and exit criteria. The agent doesn’t just know it should test its code; it follows a specific test-driven development process (Red-Green-Refactor) with a defined test pyramid ratio (80% unit, 15% integration, 5% end-to-end).

This works for a straightforward reason: LLMs are good at following instructions. Really good at it. What they’re bad at is deciding which instructions to follow when nobody gives them any. Agent Skills fills that gap by being specific where most prompts are vague.

Seven commands, one lifecycle

The project maps to a standard software development lifecycle through seven slash commands:

  DEFINE          PLAN           BUILD          VERIFY         REVIEW          SHIP
 ┌──────┐      ┌──────┐      ┌──────┐      ┌──────┐      ┌──────┐      ┌──────┐
 │ Idea │ ───▶ │ Spec │ ───▶ │ Code │ ───▶ │ Test │ ───▶ │  QA  │ ───▶ │  Go  │
 │Refine│      │  PRD │      │ Impl │      │Debug │      │ Gate │      │ Live │
 └──────┘      └──────┘      └──────┘      └──────┘      └──────┘      └──────┘
  /spec          /plan          /build        /test         /review       /ship
  • /spec forces the agent to write a Product Requirements Document before touching any code.
  • /plan breaks that spec into small, atomic tasks with acceptance criteria.
  • /build implements those tasks one thin vertical slice at a time.
  • /test runs Red-Green-Refactor with the test pyramid.
  • /review applies a five-axis code review modeled on Google’s review practices.
  • /code-simplify reduces complexity without changing behavior.
  • /ship runs pre-launch checklists, configures rollbacks, and sets up monitoring.

Each command activates the relevant skills underneath. You don’t need to remember which skill does what. The commands are the entry points, and the skills are the machinery.

What makes the skill format work (and where it gets interesting)

Every skill follows the same anatomy: an overview, triggering conditions, a step-by-step process, a rationalizations table, red flags, and verification requirements. The rationalizations table is the most unusual part. It lists the excuses an agent might use to skip a step, paired with counter-arguments.

For example, the testing skill includes a rationalization like “I’ll add tests later” with the rebuttal that tests written after implementation tend to test the code’s behavior rather than its requirements, creating brittle tests that pass today and break on the first refactor. This is a real problem with AI agents. They’ll rationalize their way around guardrails if you let them, and having pre-written counter-arguments built into the skill means the agent has already “read” the rebuttal before it can talk itself out of doing the work.

Does this always work? No. Agents can still find creative ways to shortcut things, particularly when context windows get tight or when conflicting instructions appear elsewhere in the session. But it works often enough that I keep using it, and that’s the bar for a tool like this.

The engineering principles behind the skills, unpacked

The README references several engineering concepts from Google’s internal culture, drawn from the book Software Engineering at Google and Google’s public engineering practices guide. Some of these terms get thrown around without explanation, so let’s unpack them.

Hyrum’s Law

Named after Hyrum Wright, a Google engineer. The law states:

With a sufficient number of users of an API, all observable behaviors of your system will be depended on by somebody.

In plain language: if your API has any behavior at all, even undocumented or accidental behavior, someone will eventually rely on it. A function that happens to return results in alphabetical order (even though the spec doesn’t promise this) will eventually have a caller that breaks when the order changes.

This matters for AI agents because they generate APIs constantly, both public-facing and internal. Without awareness of Hyrum’s Law, an agent will create interfaces with implicit contracts it doesn’t know about. The api-and-interface-design skill bakes in this awareness by forcing the agent to think about contract-first design: define what the API promises, not just what it happens to do.

The practical takeaway: every public interface should have explicit contracts. If you don’t want users depending on a behavior, don’t expose it. If you can’t hide it, document that it’s not guaranteed.

Chesterton’s Fence

G.K. Chesterton proposed this thought experiment in 1929: if you come across a fence in the middle of a field and can’t see a reason for it, don’t tear it down. First, figure out why someone built it. Only once you understand the original reason can you decide whether it’s still needed.

In software, this applies every time you encounter code that looks unnecessary or overly complex. The impulse to simplify is healthy, but simplifying without understanding is dangerous. That weird edge-case handler in a function might look like dead code until you learn it catches a race condition that only happens under load on Tuesdays when the cache is cold.

AI agents are particularly prone to violating Chesterton’s Fence. They see complex code and immediately want to simplify it, which is exactly what you’d want in many cases, except when the complexity exists for a reason the agent doesn’t have context for. The code-simplification skill addresses this by requiring the agent to understand why code exists before modifying it.

This is one of those principles that sounds obvious when stated but gets ignored constantly in practice, by humans and AI alike.

The Beyonce Rule

This one’s informal but it shows up in Google’s testing culture. The rule is: “If you liked it, then you should have put a test on it.” (Yes, named after the song.)

The idea is simple: if you care about a behavior, write a test that verifies it. If the behavior isn’t tested, you can’t complain when someone else’s change breaks it. In a large codebase with hundreds of contributors, untested behavior is unowned behavior.

For AI agents, the Beyonce Rule is a forcing function against the most common agent failure mode: generating code that works in isolation but breaks other things. The test-driven-development skill uses this principle to ensure agents don’t just test their own new code but also verify that existing behaviors they depend on are protected by tests.

Shift Left

“Shift Left” means moving quality checks earlier in the development process. Instead of finding bugs in production, find them in code review. Instead of finding them in code review, find them in your tests. Instead of finding them in tests, catch them in your spec.

The name comes from imagining the development lifecycle as a left-to-right timeline. The further left (earlier) you catch a problem, the cheaper it is to fix. A bug caught during implementation costs minutes. The same bug caught in production might cost hours of debugging, a rollback, and an incident report.

Agent Skills applies Shift Left throughout the entire lifecycle. The /spec command catches requirement gaps before any code is written. The /test command catches logic errors before code review. The /review command catches design issues before deployment. Each phase is a quality gate that prevents problems from flowing downstream.

This principle works especially well with AI agents because agents don’t get tired or impatient. They’ll happily run a 15-step verification process every single time, whereas a human engineer might start skipping steps by the third feature of the day.

The test pyramid (80/15/5)

The test pyramid is a model for how to distribute your testing effort. At the base: lots of fast, isolated unit tests (80%). In the middle: some integration tests that verify how components work together (15%). At the top: a few end-to-end tests that exercise the whole system (5%).

The ratios are guidelines, not hard rules. The point is that your testing portfolio should be bottom-heavy. Unit tests are cheap to write, fast to run, and precise when they fail (they point directly at the broken thing). End-to-end tests are expensive, slow, and when they fail, you spend half your day figuring out which part of the chain broke.

AI agents, left to their own devices, tend to under-test or test at the wrong level. They’ll write a single integration test where five unit tests would give better coverage with faster feedback. The test pyramid gives agents a concrete allocation target rather than the vague instruction to “write comprehensive tests.”

Trunk-based development

Instead of long-lived feature branches that diverge from main and require painful merges, trunk-based development keeps everyone committing to a single main branch (the “trunk”) with short-lived branches that get merged quickly, usually within a day.

This practice reduces integration pain. The longer a branch lives, the more it diverges from what everyone else is doing, and the harder the eventual merge becomes. Trunk-based development forces small, incremental changes that are easier to review and easier to roll back if something breaks.

For AI agents, trunk-based development pairs well with the “incremental implementation” skill. Instead of generating one massive changeset, the agent builds feature by feature, testing and committing each piece independently. This makes its work reviewable by humans in manageable chunks rather than as a single 500-line pull request where the reviewer gives up and clicks “Approve.”

Change sizing (~100 lines)

Google’s engineering practices recommend keeping code changes to roughly 100 lines of meaningful code. Larger changes are harder to review carefully, take longer to get feedback on, and are more likely to introduce subtle bugs that slip through review.

This is a constraint that agents violate constantly when left unchecked. An agent will happily generate a 400-line change if you let it. The code-review-and-quality skill enforces change sizing with a concrete target and provides splitting strategies for when a change is naturally larger, for example, splitting out refactors from feature additions, or infrastructure changes from business logic.

Anti-rationalization: the quiet workhorse

If I had to pick one design choice that makes Agent Skills more useful than a typical prompt library, it’s the anti-rationalization tables. Every skill includes a set of common excuses an agent might use to skip steps, along with documented counter-arguments.

Why does this matter? Because LLMs are trained on text that includes plenty of reasoning about why it’s okay to cut corners. “For this small change, we don’t need a full spec.” “The function is simple enough that tests would be redundant.” “Performance optimization can wait until we have real traffic.” These are all sentences that appear in real commit messages and Slack threads. The model has internalized them.

Anti-rationalization tables pre-empt this by giving the model equally strong reasoning in the other direction. It’s not a guarantee, but it tilts the balance. And in practice, agents following these skills do write specs before code, do write tests before merge, and do measure before they optimize, more often than agents without them.

The honest caveat: this only works when the skills are loaded into the agent’s context. If the context window fills up and earlier instructions get pushed out, the anti-rationalization tables go with them. Token management matters.

How skills load: progressive disclosure

Speaking of tokens, Agent Skills is designed around progressive disclosure. Each skill is a single SKILL.md file, the entry point. Supporting materials and reference checklists load only when the skill needs them.

This matters in practice, not just in theory. Context windows are finite. Loading all 20 skills and their references at once would eat a substantial chunk of the available context before the agent has even seen your code. Progressive disclosure means the agent starts with the workflow it needs right now and pulls in references only when it hits a step that requires them.

For teams working on large codebases, this matters a lot. You need room in the context for your actual source code, not just for instructions about how to write source code.

The 20 skills at a glance

The skills break into six phases. Two of them handle the “Define” stage, turning vague ideas into concrete specs through structured thinking and full PRDs. One planning skill decomposes those specs into small, verifiable tasks with acceptance criteria.

The bulk of the collection sits in the “Build” phase: six skills covering implementation, testing, context management, source verification, frontend engineering, and API design. The newest addition here is source-driven-development, which forces the agent to ground every framework decision in official documentation and cite its sources. I like this one because it directly attacks the hallucination problem. An agent that has to link to the React docs before using a hook is less likely to invent an API that doesn’t exist.

Two skills handle verification (browser testing with DevTools, systematic debugging). Four cover review: code quality, simplification, security against the OWASP Top 10, and performance with Core Web Vitals profiling. Five more handle shipping: git workflow, CI/CD, deprecation, documentation with Architecture Decision Records, and launch checklists.

Three agent personas

Beyond the skills themselves, the project includes three pre-configured agent personas for targeted reviews. There’s a code reviewer that applies the “would a staff engineer approve this?” standard, a test engineer focused on coverage analysis and the Prove-It pattern (don’t claim something works, demonstrate it), and a security auditor that runs OWASP assessments and threat modeling.

These are useful when you want a second-pass review from a specific angle. You’ve built the feature, now throw it at the security auditor and see what comes back. I’ve found the code reviewer persona particularly good at catching things I’d let slide after staring at the same code for an hour.

Balanced take: what works and what doesn’t

I’ve been using these skills for a few days now. Here’s what I think lands well and where the limitations sit.

What works:

The lifecycle structure makes sense. Most prompt engineering for AI agents is ad hoc: a paragraph here, a rule there, whatever you remember to include at 11pm. Agent Skills gives you a framework that matches how software actually gets built. That alone saves you from the “what did I forget to tell the agent?” problem.

The anti-rationalization tables pull their weight. In my experience, agents following these skills skip quality steps less often. They still do it sometimes, but the difference is noticeable.

The format is portable. Plain Markdown means it works everywhere. Copy a file into your project and you’re done.

The progressive disclosure design respects context window limits. Most prompt libraries act like you have infinite context. You don’t, and Agent Skills knows it.

What doesn’t:

Twenty skills is a lot to manage. You’ll probably use five or six regularly and forget the rest exist. The slash commands help, but there’s still a learning curve.

The skills are opinionated toward Google’s engineering culture. If your team follows different practices (feature branches instead of trunk-based development, different testing ratios, different review norms), you’ll need to fork and adapt. The MIT license makes this easy, but it’s still work.

Agent Skills can’t fix fundamental model limitations. If the underlying model hallucinates or loses context mid-session, no workflow document will save you. The skills improve behavior within the bounds of what the model can do. They don’t expand those bounds. This is worth being clear-eyed about.

The token overhead is real. Loading a skill plus its references uses context that could hold source code. For smaller context windows, this trade-off starts to bite.

Getting started

If you want to try it:

For Claude Code:

/plugin marketplace add addyosmani/agent-skills
/plugin install agent-skills@addy-agent-skills

For Cursor: copy any SKILL.md into .cursor/rules/, or reference the full skills/ directory.

For Gemini CLI:

gemini skills install https://github.com/addyosmani/agent-skills.git --path skills

For other agents, the skills are plain Markdown. Drop them into whatever system prompt or instruction file your agent reads.

What this points to

Agent Skills is, at the end of the day, a collection of Markdown files. What makes it interesting is the underlying bet: that AI agents need structured engineering discipline the same way junior engineers do. “Write good code” is a wish. “Here’s the specific process, here’s what you’ll be tempted to skip, and here’s why you shouldn’t” is a workflow.

Whether this particular project becomes the standard, I have no idea. But the pattern will spread. As AI coding agents get more capable, the bottleneck stops being what they can do and becomes what we let them do without guardrails.

The repo is open source under MIT. If you’re using AI coding agents and the output keeps disappointing you, spend an afternoon with it. You’ll at least learn something about what your agent was skipping.

GitHub: addyosmani/agent-skills

“Skill up.”-Rushi