The llama.cpp project has released version b9862, featuring a performance optimization for the gated_delta_net operation and providing pre-built binaries for macOS, Linux, Windows, Android, and openEuler.

  • Removes redundant CUDA copies after gated_delta_net by detecting the gated_delta_net -> view -> cpy pattern.
  • Allows the CUDA GDN kernel to write state snapshots directly into the recurrent cache, skipping intermediate tail writes.
  • Disables KleidiAI support for macOS Apple Silicon in this release.
  • Provides binaries for Ubuntu x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends.
  • Includes Windows builds for CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP.

This update improves inference efficiency on supported GPU architectures while maintaining broad compatibility across various operating systems and hardware accelerators.