arxiv arXiv cs.CL · 6d ago · research

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

from English

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.

Importance 3/3 Beats a top-lab benchmark New feature vs. leaders arXiv cs.CL Alibaba (Qwen) Evaluation & benchmarks Reasoning models Training methods

Benchmarks

Benchmark	Model	Score
GAIA	HydraHead	—
LMSYS Arena (Elo)	HydraHead	—
SWE-bench Verified	HydraHead	—
WebArena	HydraHead	—

Read original