The authors propose a training-free method to mitigate typographic attacks in CLIP-based vision encoders, where irrelevant text biases visual representations toward lexical meaning. By using sampling-based interpretations and circuit mining, the approach isolates specific Vision Transformer components responsible for encoding this unwanted lexical information.
- The method quantitatively attributes semantic versus lexical focus to individual attention heads through probabilistic analysis.
- Simple interventions on identified circuits improve robustness in object classification without additional training.
- These interventions outperform both supervised and other training-free defense methods.
- Applying the approach to state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy on RIO-Bench under typographic attack interference.
This mechanistic approach provides an interpretable and generalizable solution for enhancing safety-critical applications like autonomous driving against text-induced biases.