arxiv arXiv cs.CL · 8d ago · src: 9d ago · research

STATEWITNESS: Activation Explainer for Deception Auditing in LLMs

from English

STATEWITNESS introduces an activation explainer that audits deception in reasoning LLMs by reading hidden states and generating natural-language answers or structured reports. It achieves a 0.916 mean AUROC, outperforming existing black-box monitors and activation probes by 11.6% and 25.0% respectively, and provides query-level, schema, and evidence-level traces for human inspection.

Importance 3/3 Beats a top-lab benchmark New feature vs. leaders New harness with differentiators arXiv cs.CL OpenAI Google DeepMind Meta AI Evaluation & benchmarks Reasoning models Safety & alignment

Read original