A Framework for Evaluating Agentic Skills at Scale
We present a framework for evaluating agentic skills by constructing realistic tasks and assessing skill utility through task execution. Applied to 500 real-world skills, it generates 1,000 tasks and scoring rubrics, evaluating 19 agent-model configurations across proprietary and open-source models. Results show significant variation in instruction adherence and performance gains, with skills substantially altering model behavior compared to no-skill setups.