Researchers propose StochasT, a method to address the discrepancy between multi-turn training and single-turn evaluation in Large Vision-Language Models (LVLMs). The approach stochastically groups language tasks for the same image into clusters of varying sizes while preserving their organic order.
- StochasT avoids dropping data by using stochastic grouping similar to Dropout and stochastic depth.
- A benchmark-agnostic evaluation mechanism based on the Balanced Latin Square measures robustness under varying contextual dependencies.
- Experiments show the method grants LVLMs strong capabilities for both single-turn and multi-turn use cases.
This approach helps close the gap between training conditions and test scenarios, allowing models to realize their full potential despite visual attention decay and contextual overfitting.