The llama.cpp b9820 release introduces performance improvements by reintroducing less synchronizations during split compute, specifically targeting CUDA backends. This update also provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler across CPU, GPU, and specialized hardware accelerators.
- Improves CUDA performance via reduced synchronizations between tokens.
- Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
- Relaxes sync requirements between input copies on supported backends like CUDA.
- Exchanges synchronous copy with async copy function and adds macro guards for non-CUDA builds.
- Reworks backend detection in ggml-backend.cpp to avoid linking conflicts.
- Fixes hip backend pipeline parallel bugs by adding single-GPU synchronizations in multi-GPU settings.
- Excludes hip/MUSA from copy_from_host CPU split to GPU split optimization as a precautionary measure.
The release enables faster inference on CUDA devices through optimized asynchronous operations while maintaining compatibility across a wide range of operating systems and hardware backends.