The llama.cpp project released version b9876, which addresses a critical crash when using tensor parallelism combined with CPU-offloaded Mixture of Experts (MoE) experts.

  • Fixes an abort during warm-up on MoE models caused by a GGML_ASSERT failure in ggml-backend-meta.cpp.
  • Resolves the issue where mirrored non-contiguous tensors for the MoE router output triggered an assertion error.
  • Moves split-state lookup above the contiguity assertion to allow the mirrored case in both get_tensor and set_tensor operations.
  • Provides binaries for macOS (Apple Silicon and Intel), Linux, Android, Windows, and openEuler across CPU, Vulkan, ROCm, CUDA, OpenVINO, SYCL, and HIP backends.

This fix enables users to successfully run MoE models with tensor parallelism and CPU-offloaded experts without encountering backend assertion failures.