The llama.cpp b9820 release introduces performance improvements by reintroducing less synchronizations during split compute, specifically targeting CUDA backends. This update also provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler across CPU, GPU, and specialized hardware accelerators.

  • Improves CUDA performance via reduced synchronizations between tokens.
  • Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
  • Relaxes sync requirements between input copies on supported backends like CUDA.
  • Exchanges synchronous copy with async copy function and adds macro guards for non-CUDA builds.
  • Reworks backend detection in ggml-backend.cpp to avoid linking conflicts.
  • Fixes hip backend pipeline parallel bugs by adding single-GPU synchronizations in multi-GPU settings.
  • Excludes hip/MUSA from copy_from_host CPU split to GPU split optimization as a precautionary measure.

The release enables faster inference on CUDA devices through optimized asynchronous operations while maintaining compatibility across a wide range of operating systems and hardware backends.