A new framework using Sparse Autoencoders extracts and analyzes visual, textual, and multimodal concepts from Vision Language Models. Experiments on LLaVA-NeXT show up to 45% improvement in visual concept quality and systematic identification of multimodal concepts, offering a structured approach to understanding VLM internal representations.
Extraction and Analysis of Multimodal Concepts in Vision Language Models
from English