Researchers propose a training-free method to select which layers in hybrid attention models should retain full attention, addressing the inefficiency of fixed patterns in long-context inference. By measuring negative log-likelihood degradation on answer tokens, the approach identifies layers critical for maintaining accuracy when switching to sliding-window attention.
- The method selects layers by computing the drop in negative log-likelihood when a layer uses sliding-window instead of full attention.
- On LongMemEval with Qwen3-4B, it achieves 64.6% accuracy using only 1/4 full-attention layers, matching the 65.0% accuracy of a 1/2-FA periodic baseline while halving computational cost.
- It outperforms SWAA-reported periodic 1/4-FA baselines by 10.4 percentage points and LightTransfer-style baselines by 26.4 percentage points.
- De-confounding analysis confirms the selection signal aligns with long-range attention needs rather than generic layer sensitivity.
- The calibration process requires approximately 15 minutes of one-time computation.
This approach advances the efficiency-accuracy Pareto frontier for long-context LLM deployment by enabling significant computational savings without requiring model retraining.