The llama.cpp project has released version b9851, which includes a fix for CUDA to prevent integer truncation and overflow errors in the flash_attn_mask_to_KV_max kernel. This update addresses issues related to KQ mask strides within the specified kernel.

  • macOS Apple Silicon (arm64) binaries are available, while KleidiAI support is disabled.
  • Linux builds cover Ubuntu x64 and arm64 for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL FP32/FP16.
  • Android arm64 (CPU) binaries are provided for mobile devices.
  • Windows releases include CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP variants.
  • openEuler builds for x86 and aarch64 architectures are listed, with some configurations disabled.
  • A standalone UI binary is also included in the release assets.

This release ensures stability for CUDA users by correcting calculation errors and provides comprehensive pre-built binaries across major operating systems and hardware accelerators.