StochasT improves visual instruction tuning with stochastic turn depth

Researchers propose StochasT, a method to address the discrepancy between multi-turn training and single-turn evaluation in Large Vision-Language Models (LVLMs). The approach stochastically groups language tasks for the same image into clusters of varying sizes while preserving their organic order.

StochasT avoids dropping data by using stochastic grouping similar to Dropout and stochastic depth.
A benchmark-agnostic evaluation mechanism based on the Balanced Latin Square measures robustness under varying contextual dependencies.
Experiments show the method grants LVLMs strong capabilities for both single-turn and multi-turn use cases.

This approach helps close the gap between training conditions and test scenarios, allowing models to realize their full potential despite visual attention decay and contextual overfitting.