When Top-1 Fails: Calibrating LoRA Monitors for Masked Diffusion LMs

This study evaluates the effectiveness of top-1 argmax concentration as a collapse warning during the fine-tuning of discrete diffusion language models (DLMs) using Low-Rank Adaptation (LoRA). The authors find that this metric has zero precision because it saturates before optimization begins, failing to detect actual training collapses.

Analysis of 816 LoRA/PEFT configurations across three DLM families showed the warning fired for every case while logs recorded zero actual collapses at the 200-step horizon.
The failure is attributed to pre-equilibrium saturation, where top-1 concentration is already high before optimization and becomes insensitive to final training stability.
Evaluating max LoRA gradient norm on a held-out LLaDA-family split identified top-decile final-loss configurations with precision 0.68 and F1=0.79.
Autoregressive controls and cross-family threshold failures limit the result to short-horizon DLM-LoRA inspection rather than serving as a universal collapse detector.

The authors recommend dropping top-1 as a PEFT alarm and instead logging max-gradient early in training, with thresholds calibrated per DLM family for effective inspection.