AIR: Adaptive Interleaved Reasoning with Code in MLLMs

This paper introduces AIR, a method that empowers multimodal large language models with adaptive interleaved reasoning capabilities through extended reinforcement learning training on code-augmented complex numerical computation tasks. The authors address the limitation of existing literature, which primarily focuses on tool-use within vision-perception tasks and relies on predefined heuristics incapable of handling numerical computations. To solve this, they propose a comprehensive three-component solution including a two-stage cold-start data construction pipeline, data filtering strategies for reinforcement learning dataset curation, and an adaptive tool-invocation strategy leveraging a group-constrained reward function. Extensive experiments demonstrate that after reinforcement learning training with this reward function, performance improves by an average of 6.1 percentage points on evaluation benchmarks. Specifically, the accuracy for interleaved reasoning samples increases by 9.9 percentage points, while the overall success rate of tool-use exceeds 95 percent. The researchers provide their data and code for public access at a specified GitHub repository.