The llama.cpp b9857 release introduces a comprehensive rework of the Hexagon Flash Attention implementation, focusing on optimizations and accuracy improvements. This update includes significant changes to the hex-mm and hex-fa modules, such as folding quant tasks into main matmul threads, fusing with ADD operations, and optimizing mask processing.
- Hexagon Flash Attention (hex-fa) optimizations include factorizing ukernels, moving kernel-parameter computation to the host, and adding support for FA_SELECT and Sinks.
- Performance enhancements involve updating Hvx fallback thresholds to recover throughput regressions, optimizing mask DMA caching, and using aligned loads and uint32_t indices.
- Numerical precision improvements include keeping softmax accumulators in fp32, replacing vec_exp_f32 with vec_exp2_f16, and avoiding conversion overflows by not using -inf for mask initialization.
- The release provides binaries for macOS (Apple Silicon and Intel), Linux (CPU, Vulkan, ROCm, OpenVINO, SYCL), Android, Windows (CPU, CUDA 12/13, Vulkan, OpenCL, HIP, OpenVINO, SYCL), and openEuler.
This update improves inference performance on Hexagon DSPs and expands hardware support across multiple platforms and accelerators for llama.cpp users.