Adversarial Pragmatics introduces benchmark for instruction conflict and embedded commands

This paper introduces adversarial pragmatics as a new benchmark and annotation protocol designed to evaluate AI model behavior under complex linguistic conditions such as instruction conflict, embedded commands, and policy ambiguity. Existing safety evaluations often oversimplify these nuances into pass/fail labels, obscuring the root causes of failures like capability limits or unstable evaluator judgments.

The framework provides a linguistically controlled taxonomy for analyzing ambiguous natural-language behavior in agentic tasks.
It includes an 18-item seed benchmark with validator-enforced metadata and a 54-row local seed pilot.
An expert-evaluation protocol distinguishes between task success, policy compliance, safety risk, refusal outcome, and evaluator confidence.
The methodology offers metrics for judge validity, diagnostic ambiguity, and taxonomy drift to validate safety evaluations and LLM judges.