The llama.cpp project has released version b9862, featuring a performance optimization for the gated_delta_net operation and providing pre-built binaries for macOS, Linux, Windows, Android, and openEuler.
- Removes redundant CUDA copies after gated_delta_net by detecting the gated_delta_net -> view -> cpy pattern.
- Allows the CUDA GDN kernel to write state snapshots directly into the recurrent cache, skipping intermediate tail writes.
- Disables KleidiAI support for macOS Apple Silicon in this release.
- Provides binaries for Ubuntu x64/arm64/s390x with CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends.
- Includes Windows builds for CPU, OpenCL Adreno, CUDA 12/13, Vulkan, OpenVINO, SYCL, and HIP.
This update improves inference efficiency on supported GPU architectures while maintaining broad compatibility across various operating systems and hardware accelerators.