The llama.cpp b9828 release introduces significant OpenCL enhancements, specifically reworking the Flash Attention kernels for f16 and f32 precision. This update includes new prefill prepass kernels and support for q4_0 and q8_0 quantization formats.
- Reworked FA kernel for f16 and f32 with optimized tile padding and masking logic.
- Added FA kernels for q4_0 and q8_0 quantization, including dequant kernels and SOA tensor support.
- Introduced a FA tile tuning table with override capabilities and fixed infinity handling for -cl-finite-math-only.
- Provided pre-built binaries for macOS (Apple Silicon/Intel), Linux (CPU/Vulkan/ROCm/OpenVINO/SYCL), Windows (CPU/CUDA/Vulkan/HIP/OpenVINO/SYCL), Android, and openEuler.
This release enables more efficient inference on OpenCL-compatible hardware by optimizing memory access patterns and supporting additional quantization types.