CAHP introduces a post-hoc framework that uses graph-theoretical clustering and information-theoretic measures to select complementary attention heads in Transformers. It automatically determines head retention without predefined sparsity, identifying a performance degradation threshold to ensure minimal model loss, and outperforms baselines in high-compression scenarios by preserving functionally critical heads in intermediate layers.
CAHP: Complementary Attention Head Pruning for Efficient Transformers
from English