arxiv arXiv cs.CL · 8d ago · research

ConSA: Learnable Sparsity Control in Hybrid Attention

from English

ConSA introduces a framework that learns optimal full vs. sliding-window attention allocation using L0 regularization and augmented Lagrangian constraints. It outperforms rule-based methods, with SWA placed in bottom layers and FA concentrated in middle-layer blocks, a pattern consistent across model scales and sparsity levels.

Importance 2/3 New harness with differentiators arXiv cs.CL Google DeepMind Meta AI OpenAI Evaluation & benchmarks Inference efficiency Training methods

Read original