Design Philosophy
Validate against standards, not opinions. Every check traces back to a published specification, a research paper, or documented practitioner experience. When we add a new check, we document what informed it and why it matters.
Progressive depth. Checks are organized in layers. The first layer validates structure: are the right fields and sections present? The next layer checks semantics: do instructions contradict each other? Deeper layers examine content quality, security, agent readiness, and whether the knowledge inside a skill is genuinely substantive or just well-formatted filler.
Reward quality, not just punish problems. SkillCheck reports both weaknesses and strengths. A well-written gotchas section, concrete code references, or clear error handling don't just avoid penalties; they show up as positive signals in your report. Good work should be visible.
Free for structure, Pro for substance. Free checks tell you if the skill is built correctly. Pro checks tell you if it's built well. Every finding carries a severity: critical (fix it), warning (you should), suggestion (consider it), or strength (you did something right).
How the Checks Evolved
SkillCheck started by implementing one lab's guidelines. Each phase made the checks more grounded and harder to game.
Phase 1
Build to Spec
v1.0 – v1.4 (Dec 2025)
Foundation checks built directly from Anthropic's official guidance on skill design. Five Anthropic blog posts defined what a well-formed skill looks like: frontmatter fields, trigger conditions, scope boundaries, testing scenarios. WCAG 2.2 standards informed accessibility checks. MCP Best Practices from modelcontextprotocol.io established protocol compliance.
- • Claude Code Best Practices (Anthropic)
- • MCP Best Practices (modelcontextprotocol.io)
- • WCAG 2.2 (W3C)
- • Anthropic: Skill Trust & Security Audit
Categories established: Structure (1), Body (2), Naming (3), Semantics (4), Anti-Slop (5), Visual (6), Security (7), Quality (8), Token (9), Enterprise (10)
Phase 2
Build from Practice
v3.10 – v3.14 (Mar 2026)
Practitioner insights from people building and shipping skills at scale. Thariq Shihipar's field observations from the Claude Code team introduced description-as-trigger-condition (4.8), railroading detection (4.9), and gotchas as a quality signal (8.9). The Agent Readiness framework (v3.10) added 6 new pillars and 28 checks for multi-agent orchestration, based on patterns observed in production skill portfolios.
Categories added: Quality Pro (11), Workflow (12), Reference Integrity (13), Eval Readiness (14), Orchestration Safety (15), Autonomy Design (16), Composability (17), Observability (18)
Phase 3
Build from the Field
v3.16 – v3.17 (Apr 2026)
Cross-lab methodology. OpenAI's systematic approach to testing agent skills informed the Eval Kit (v3.17), which auto-generates should-trigger and should-NOT-trigger test prompts. Google's ADK agent type hierarchy (LLM agents, workflow agents, custom agents) inspired SkillCheck's own design pattern classification (Reviewer, Generator, Inversion, Pipeline, Tool Wrapper) for pattern-specific validation (Category 19). CI/CD integration (v3.16) brought SkillCheck into GitHub Actions workflows.
Categories added: Design Pattern (19), Trigger Collision (20), Eval Kit (21)
Phase 4
Build from the Ecosystem
v3.18.0
Academic research and community-built tools. Wang et al.'s 18-category smell taxonomy for MCP tool descriptions identified specific quality defects in tool definitions. Hasan et al.'s follow-up framework formalized smell-aware evaluation scoring. Glama's TDQS industry benchmark validated similar quality dimensions at scale. Adversarial analysis of knowledge capture tools in the Chinese developer ecosystem revealed a taxonomy of what makes documented knowledge genuinely valuable vs. hollow.
- • MCP Tool Descriptions Are Smelly! (Wang et al., arXiv:2602.14878)
- • From Docs to Descriptions: Smell-Aware Evaluation (Hasan et al., arXiv:2602.18914)
- • Tool Description Quality Score (TDQS) (Glama)
- • Adversarial skill file analysis (colleague-skill, anti-distill)
Categories added: Knowledge Density (22)
How a Check Works
Every check follows the same pattern: pre-compiled regex scans the skill content line by line, skipping code blocks and frontmatter. Compound patterns require multiple signals in the same line to fire. Results carry a severity level that feeds the scoring engine.
// Example: consequence pattern detection (Pro)
Input: "Never call HTTP inside transactions; we had a 3-hour outage"
Match: imperative ("Never") + consequence ("we had a 3-hour outage")
Result: strength — "Knowledge density: consequence pattern found"
// Contrast: hollow content (no match)
Input: "Follow team standards for transaction handling"
Match: none (no imperative + consequence compound)
Scoring
Skills start at 100 points. Findings subtract based on severity. Strengths don't add points but appear in the quality report as positive signals.
| Severity |
Impact |
Meaning |
| Critical | −20 | Structural violation. Must fix. |
| Warning | −5 | Quality gap. Should fix. |
| Suggestion | −1 | Minor improvement. Nice to fix. |
| Strength | +0 | Positive signal. No penalty, appears in report. |
Free vs Pro
Free checks validate shape: does the skill have the right structure, fields, and sections? Pro checks validate substance: is the content inside those sections actually good? Free tells you what's missing. Pro tells you whether what's there is real.
Independence
SkillCheck is an independent project. It is not affiliated with, endorsed by, or officially connected to Anthropic, OpenAI, Google, or any other AI lab. Research from these organizations informed specific check categories, as documented above. The implementation, scoring, and quality judgments are SkillCheck's own.