The llama.cpp project has released version b9847, which includes a fix for Gemma E4B MTP FlashAttention on CUDA and the removal of an unused template declaration.

  • Fixes Gemma E4B MTP FlashAttention in CUDA backend (#25148)
  • Removes unused template declaration
  • macOS Apple Silicon (arm64) binaries available
  • macOS Intel (x64) binaries available
  • iOS XCFramework provided
  • Ubuntu x64 and arm64 CPU builds included
  • Ubuntu Vulkan, ROCm 7.2, OpenVINO, SYCL FP32, and SYCL FP16 builds available
  • Android arm64 CPU build released
  • Windows x64 and arm64 CPU builds provided
  • Windows CUDA 12.4 and 13.3 builds with DLLs included
  • Windows Vulkan, OpenVINO, SYCL, and HIP builds available
  • openEuler x86 and aarch64 builds for 310p and 910b (ACL Graph) processors
  • General UI binary released