HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.
arxiv
arXiv cs.CL
·
6d ago
·
research
HydraHead: Head-Level Hybrid Attention for Long-Context Performance
from English
Importance 3/3
Beats a top-lab benchmark
New feature vs. leaders
arXiv cs.CL
Alibaba (Qwen)
Evaluation & benchmarks
Reasoning models
Training methods
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| GAIA | HydraHead | — |
| LMSYS Arena (Elo) | HydraHead | — |
| SWE-bench Verified | HydraHead | — |
| WebArena | HydraHead | — |