llama.cpp b9820 release: reduced CUDA syncs and new binaries

The llama.cpp b9820 release introduces performance improvements by reintroducing less synchronizations during split compute, specifically targeting CUDA backends. This update also provides pre-built binaries for macOS, Linux, Windows, Android, and openEuler across CPU, GPU, and specialized hardware accelerators.

Improves CUDA performance via reduced synchronizations between tokens.
Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
Relaxes sync requirements between input copies on supported backends like CUDA.
Exchanges synchronous copy with async copy function and adds macro guards for non-CUDA builds.
Reworks backend detection in ggml-backend.cpp to avoid linking conflicts.
Fixes hip backend pipeline parallel bugs by adding single-GPU synchronizations in multi-GPU settings.
Excludes hip/MUSA from copy_from_host CPU split to GPU split optimization as a precautionary measure.

The release enables faster inference on CUDA devices through optimized asynchronous operations while maintaining compatibility across a wide range of operating systems and hardware backends.