Design Philosophy
Validate against standards, not opinions. Every check traces back to a published specification, a research paper, or documented practitioner experience. When we add a new check, we document what informed it and why it matters.
Progressive depth. Checks are organized in layers. The first layer validates structure: are the right fields and sections present? The next layer checks semantics: do instructions contradict each other? Deeper layers examine content quality, security, agent readiness, and whether the knowledge inside a skill is genuinely substantive or just well-formatted filler.
Reward quality, not just punish problems. SkillCheck reports both weaknesses and strengths. A well-written gotchas section, concrete code references, or clear error handling don't just avoid penalties; they show up as positive signals in your report. Good work should be visible.
Free for structure, Pro for substance. Free checks tell you if the skill is built correctly. Pro checks tell you if it's built well. Every finding carries a severity: critical (fix it), warning (you should), suggestion (consider it), or strength (you did something right).
How the Checks Evolved
SkillCheck started by implementing one lab's guidelines. Each phase made the checks more grounded and harder to game.
Phase 1
Build to Spec
v1.0 – v1.4 (Dec 2025)
Foundation checks built directly from Anthropic's official guidance on skill design. Five Anthropic blog posts defined what a well-formed skill looks like: frontmatter fields, trigger conditions, scope boundaries, testing scenarios. WCAG 2.2 standards informed accessibility checks. MCP Best Practices from modelcontextprotocol.io established protocol compliance.
- • Claude Code Best Practices (Anthropic)
- • MCP Best Practices (modelcontextprotocol.io)
- • WCAG 2.2 (W3C)
- • Anthropic: Skill Trust & Security Audit
Categories established: Structure (1), Body (2), Naming (3), Semantics (4), Anti-Slop (5), Visual (6), Security (7), Quality (8), Token (9), Enterprise (10)
Phase 2
Build from Practice
v3.10 – v3.14 (Mar 2026)
Practitioner insights from people building and shipping skills at scale. Thariq Shihipar's field observations from the Claude Code team introduced description-as-trigger-condition (4.8), railroading detection (4.9), and gotchas as a quality signal (8.9). The Agent Readiness framework (v3.10) added 6 new pillars and 28 checks for multi-agent orchestration, based on patterns observed in production skill portfolios.
Categories added: Quality Pro (11), Workflow (12), Reference Integrity (13), Eval Readiness (14), Orchestration Safety (15), Autonomy Design (16), Composability (17), Observability (18)
Phase 3
Build from the Field
v3.16 – v3.17 (Apr 2026)
Cross-lab methodology. OpenAI's systematic approach to testing agent skills informed the Eval Kit (v3.17), which auto-generates should-trigger and should-NOT-trigger test prompts. Google's ADK agent type hierarchy (LLM agents, workflow agents, custom agents) inspired SkillCheck's own design pattern classification (Reviewer, Generator, Inversion, Pipeline, Tool Wrapper) for pattern-specific validation (Category 19). CI/CD integration (v3.16) brought SkillCheck into GitHub Actions workflows.
Categories added: Design Pattern (19), Trigger Collision (20), Eval Kit (21)
Phase 4
Build from the Ecosystem
v3.18.0
Academic research and community-built tools. Hasan et al.'s 6-component scoring framework analyzed 856 MCP tools across 103 servers and found 97% contained at least one description smell. Wang et al.'s 18-category smell taxonomy spans four quality dimensions (accuracy, functionality, completeness, conciseness) across 10,831 servers, showing standard-compliant descriptions reach 72% selection probability against a 20% baseline. Glama's TDQS industry benchmark validated similar quality dimensions at scale. Adversarial analysis of knowledge capture tools in the Chinese developer ecosystem revealed a taxonomy of what makes documented knowledge genuinely valuable vs. hollow.
- • MCP Tool Descriptions Are Smelly! (Hasan et al., arXiv:2602.14878)
- • From Docs to Descriptions: Smell-Aware Evaluation (Wang et al., arXiv:2602.18914)
- • Tool Description Quality Score (TDQS) (Glama)
- • Adversarial skill file analysis (colleague-skill, anti-distill)
Categories added: Knowledge Density (22)
Phase 5
Build for the Marketplace
v3.20.0 (April 2026)
Skill quality is one mechanism inside a bigger problem: how do orgs run a marketplace of skills, plugins, and MCP servers without it turning into tribal knowledge chaos? Phase 5 widens the rubric from SKILL.md only to three artifact types. Cat 23 (Agent Integration Readiness) prices MCP servers along Sam Morrow's four axes (token efficiency, security, unique unlocks, execution environment), populated with Anthropic's six production patterns. Cat 24 (Skill Marketplace Governance) prices plugin marketplaces against the Anthropic reference schema with governance recommendations (maintainers, change-gate evals, deprecation paths) the reference itself doesn't yet ship. Cat 25 (Memory Governance) is design-locked against Anthropic's Managed Agents memory primitives but waits for first-party memory-using exemplar skills before its rubric leaves provisional state.
The defensible claim from Phase 5: SkillCheck measures whether your server respects the user-vs-agent distinction the MCP spec explicitly names, and whether your marketplace declares the ownership and eval governance the reference Anthropic marketplace publicly lacks. The 21-plugin benchmark of anthropics/knowledge-work-plugins (April 2026) empirically confirms both: zero criticals across the entire reference (the rubric is well-calibrated), and 0 of 21 plugins declare maintainers or evals (the rubric still has signal).
Categories added: Agent Integration Readiness (23), Skill Marketplace Governance (24), Memory Governance (25, design)
How a Check Works
Every check follows the same pattern: pre-compiled regex scans the skill content line by line, skipping code blocks and frontmatter. Compound patterns require multiple signals in the same line to fire. Results carry a severity level that feeds the scoring engine.
// Example: consequence pattern detection (Pro)
Input: "Never call HTTP inside transactions; we had a 3-hour outage"
Match: imperative ("Never") + consequence ("we had a 3-hour outage")
Result: strength, "Knowledge density: consequence pattern found"
// Contrast: hollow content (no match)
Input: "Follow team standards for transaction handling"
Match: none (no imperative + consequence compound)
How Checks Evaluate
SkillCheck's checks fall into three evaluation styles, depending on what they're measuring. The style is independent of whether the check is Free or Pro — both tiers include all three.
Structural
Things that are either present or absent: required frontmatter fields, file references that resolve, secrets that shouldn't be in source, token counts. These checks are exact — they either pass or fail, with no ambiguity. Categories: Structure (1), Body (2), Visual (6), Security (7), Token (9), Reference Integrity (13), Trigger Collision (20).
Pattern
Content matched against named patterns: anti-slop phrases, knowledge density signals, design patterns, observability hooks, governance checklists. The patterns are explicit and inspectable — you can read the rules and predict the outcome. Categories: Naming (3), Anti-Slop (5), Quality (8), Enterprise (10), Quality Pro (11), Eval Readiness (14), Composability (17), Observability (18), Design Pattern (19), Knowledge Density (22).
Judgment
Content that needs reading comprehension: do these instructions contradict each other, is this workflow actually clear, does this subagent prompt give enough specificity, does this skill respect autonomy boundaries. Today these checks use pattern-matching as a floor; the ceiling is rubric-based evaluation against a published criteria list. Categories: Semantics (4), Workflow (12), Orchestration Safety (15), Autonomy Design (16).
Why this matters: structural and pattern checks are reproducible — same input, same finding. Judgment checks have a wider tolerance band by design; what matters is that the criteria are documented and applied transparently. SkillCheck's judgment checks publish their criteria, not just their verdicts.
Scoring
Skills start at 100 points. Findings subtract based on severity. Strengths don't add points but appear in the quality report as positive signals.
| Severity |
Impact |
Meaning |
| Critical | −20 | Structural violation. Must fix. |
| Warning | −5 | Quality gap. Should fix. |
| Suggestion | −1 | Minor improvement. Nice to fix. |
| Strength | +0 | Positive signal. No penalty, appears in report. |
Free vs Pro
Free checks validate shape: does the skill have the right structure, fields, and sections? Pro checks validate substance: is the content inside those sections actually good? Free tells you what's missing. Pro tells you whether what's there is real.
Independence
SkillCheck is an independent project. It is not affiliated with, endorsed by, or officially connected to Anthropic, OpenAI, Google, or any other AI lab. Research from these organizations informed specific check categories, as documented above. The implementation, scoring, and quality judgments are SkillCheck's own.