NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation
Researchers propose a training-free method to select which layers in hybrid attention models should retain full attention, addressing the inefficiency of fixed patterns in long-context inference. By measuring negative log-likelihood degradation on answer tokens, the approach identifies layers critical for maintaining accuracy when switching to sliding-window attention.