The llama.cpp project has released version b9789, which includes a critical fix for quantizing Mixture of Experts (MoE) models with multi-token prediction. This update addresses issues identified in pull request #24986 to ensure proper handling of these specific model architectures. The release provides pre-built binaries for macOS Apple Silicon and Intel, as well as an iOS XCFramework. Linux users can download builds for Ubuntu across CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends. Windows support includes CPU, CUDA 12.4 and 13.3, Vulkan, OpenVINO, SYCL, and HIP variants. Additional platforms such as Android arm64 and openEuler are also supported with specific hardware configurations.
llama.cpp b9789 Release Fixes MoE Quantization and Provides Multi-Platform Binaries
from English