Towards Robustness against Typographic Attack with Training-free Concept Localization

The authors propose a training-free method to mitigate typographic attacks in CLIP-based vision encoders, where irrelevant text biases visual representations toward lexical meaning. By using sampling-based interpretations and circuit mining, the approach isolates specific Vision Transformer components responsible for encoding this unwanted lexical information.

The method quantitatively attributes semantic versus lexical focus to individual attention heads through probabilistic analysis.
Simple interventions on identified circuits improve robustness in object classification without additional training.
These interventions outperform both supervised and other training-free defense methods.
Applying the approach to state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy on RIO-Bench under typographic attack interference.

This mechanistic approach provides an interpretable and generalizable solution for enhancing safety-critical applications like autonomous driving against text-induced biases.