Researchers introduce Distill to Detect (D2D), a method that exposes hidden preferential biases in large language models by converting distributional shifts into detectable text. The technique uses a KV-cache prefix adapter, called a cartridge, to amplify the divergence between a suspected model and its base version.
- D2D distills the shift between a model and its base into a cartridge that concentrates dominant divergences.
- The method amplifies hidden biases so they are reliably detectable across multiple bias types.
- A theoretical framework explains the efficacy of D2D through Fisher-weighted projection of logit distribution shifts.
By turning prefix-tuning adapters into detection tools, D2D provides a practical building block for auditing hidden behaviors in deployed language models.