Detecting Malicious Agent Skills in the Wild using Attention

The authors present Locate-and-Judge, a two-stage detector designed to identify malicious skills in LLM agent marketplaces where traditional prompt-injection defenses fail.

The system uses a lightweight locator to score structural spans by instruction-following attention and retains only the top-K spans for detailed judgment.
This approach reduces costs by an order of magnitude compared to direct LLM-based scanning, enabling marketplace-scale auditing with negligible expense.
Locate-and-Judge dominates keyword and regex baselines at comparable cost and successfully flagged dozens of live malicious skills, including those missed by SkillSpector and Cisco Skill Scanner.

The method allows for the efficient audit of entire skill marketplaces rather than just samples, surfacing hidden threats that evade existing detection tools.