Distill to Detect surfaces stealth LLM biases via cartridge distillation

Researchers introduce Distill to Detect (D2D), a method that exposes hidden preferential biases in large language models by converting distributional shifts into detectable text. The technique uses a KV-cache prefix adapter, called a cartridge, to amplify the divergence between a suspected model and its base version.

D2D distills the shift between a model and its base into a cartridge that concentrates dominant divergences.
The method amplifies hidden biases so they are reliably detectable across multiple bias types.
A theoretical framework explains the efficacy of D2D through Fisher-weighted projection of logit distribution shifts.

By turning prefix-tuning adapters into detection tools, D2D provides a practical building block for auditing hidden behaviors in deployed language models.