arxiv arXiv cs.LG · 6d ago · research

StreamKL: Fast and Memory-Efficient KL Divergence for Attention Distillation

from English

StreamKL introduces a fused GPU primitive that eliminates quadratic memory usage in attention distillation by streaming query-key tiles through on-chip SRAM. It achieves up to 43x speedup in forward and 14x in backward passes, reducing extra HBM footprint from O(N_QN_K) to O(1), enabling long-context distillation on a single GPU.

Importance 3/3 New feature vs. leaders New harness with differentiators arXiv cs.LG NVIDIA Evaluation & benchmarks Inference efficiency Training methods

Read original