Independent developer darkrishabh has released agent-skills-eval, an open-source test runner for evaluating AI agent skills that has gained 447 stars on GitHub since its May 6, 2026 release. The tool addresses the growing need for systematic testing as AI coding assistants become standardized across platforms.
Full Implementation of AgentSkills.io Specification
Agent-skills-eval provides complete implementation of the agentskills.io specification, which has become a standard for Claude Code, Claude Desktop, Cursor, and other AI coding assistants. The tool validates SKILL.md files, supports evals/evals.json configuration, implements official iteration-N artifact layouts, and enforces frontmatter rules.
The testing framework works out of the box with OpenAI, Together, Groq, and Anthropic APIs via OpenAI-compatible layers, as well as local Llama servers—essentially any service that speaks the OpenAI chat API. This broad compatibility makes it accessible for developers across different AI platforms.
CLI and SDK for Multiple Use Cases
The project offers both a one-liner CLI for continuous integration pipelines and a full SDK for custom implementations. Developers can integrate agent-skills-eval into CI workflows for automated testing or build custom pipelines, providers, and dashboards using the SDK.
A standout feature is automatic judge assertion promotion: when developers provide an expected_output without explicit assertions, the SDK automatically converts the expected output into a judge assertion. This allows minimal eval files to produce meaningful pass/fail grading without extensive configuration.
Growing Infrastructure Around Agent Skills
The tool supports agent-evals, agent-skills, agentskills, ai-agents, cli, jsonl, llm-evals, llm-evaluation, openai-compatible, typescript, and yaml according to its GitHub topics. This comprehensive tag set reflects the tool's position at the intersection of multiple emerging standards in AI agent development.
As AI agent skills become more standardized across platforms, testing infrastructure like agent-skills-eval becomes crucial for ensuring consistent behavior and reliability. The project's rapid star accumulation suggests strong demand for systematic agent skill evaluation in the developer community.
Key Takeaways
- Agent-skills-eval gained 447 GitHub stars since its May 6, 2026 release by indie developer darkrishabh
- The tool provides full implementation of the agentskills.io specification used by Claude Code, Claude Desktop, Cursor, and other AI coding assistants
- Works with OpenAI, Together, Groq, Anthropic, and local Llama servers through OpenAI-compatible APIs
- Offers both one-liner CLI for CI integration and full SDK for custom pipelines and dashboards
- Features automatic judge assertion promotion that converts expected outputs into meaningful pass/fail grading without extensive configuration