Auditing Order Sensitivity in Multimodal Large Language Models

The study introduces Facet-Probe, a five-facet audit of 18 frontier and open-weight multimodal large language models to assess order sensitivity. Standard benchmarks often miss whether shuffling evidence changes answers, a reliability property highlighted by emerging AI evaluation guidelines. Using a Bayesian item-response model, the researchers separated ordering noise from per-facet bias and estimated decoder-stochastic floors via same-ordering controls. The audit found that none of the 18 models are order-invariant, with panel-mean flip rates spanning 24-50% across different facets. Even the best-performing model flipped its answer on 13.4% of trials, indicating that higher capability does not eliminate this vulnerability. Mitigation tests using training-free prompt changes proved modality-conditional and failed to transfer between text and visual reasoning tasks. These findings suggest that prompt-level fixes are insufficient for general order robustness, motivating architectural solutions. The authors propose cross-ordering flip rate as a standard reporting axis for future MLLM evaluations.