The llama.cpp project has released version b9851, which includes a fix for CUDA to prevent integer truncation and overflow errors in the flash_attn_mask_to_KV_max kernel. This update addresses issues related to KQ mask strides within the specified kernel.
- macOS Apple Silicon (arm64) binaries are available, while KleidiAI support is disabled.
- Linux builds cover Ubuntu x64 and arm64 for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL FP32/FP16.
- Android arm64 (CPU) binaries are provided for mobile devices.
- Windows releases include CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP variants.
- openEuler builds for x86 and aarch64 architectures are listed, with some configurations disabled.
- A standalone UI binary is also included in the release assets.
This release ensures stability for CUDA users by correcting calculation errors and provides comprehensive pre-built binaries across major operating systems and hardware accelerators.