The authors propose PRP, a Proactive Routing Paradigm that accelerates inference in large multimodal models by enabling early decision-making through joint evaluation of draft and target model competence. This approach addresses the bottleneck of establishing reliable query difficulty signals in multimodal settings without relying on data-sensitive supervised fine-tuning or post-hoc token probabilities.
- PRP employs Draft Rating Learning (DRL) to equip the draft model with an internal confidence estimator.
- Joint Rating Learning (JRL) predicts how well the target model can handle a given query to prioritize samples it excels at.
- The method enables fine-grained, instance-level proactive routing that substantially accelerates inference without compromising overall performance.
- Extensive experiments across multiple multimodal reasoning benchmarks validate the effectiveness and efficiency of the proposed paradigm.
This strategy allows for cooperative inference between small draft and large target models, optimizing efficiency and accuracy by adaptively routing queries based on their difficulty rather than processing them after a complete output.