Multimodal Chain-of-Thought reasoning improves performance in mathematical and scientific reasoning but harms visual grounding and object counting in perception tasks. Models exhibit a 'Look Light, Think Heavy' pattern, where visual reflection diminishes while verbal reasoning increases, indicating a persistent bottleneck in visual introspection during multimodal reasoning.
Multimodal Chain-of-Thought: Capabilities and Limitations
from English