llama.cpp PR #20793: Reintroducing less synchronizations during split compute

Pull request #20793 reintroduces reduced synchronization during split compute operations in llama.cpp, primarily targeting CUDA performance improvements. The changes involve exchanging synchronous copies for async copies and relaxing sync requirements between input copies on supported backends.

Improves CUDA performance via fewer synchronizations between tokens.
Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async().
Introduces a function to relax sync requirements between input copies, currently supported for CUDA.
Exchanges synchronous copy with async copy function and adds macro guards for non-CUDA builds.
Reworks backend detection in ggml-backend.cpp to avoid linking conflicts and simplifies synchronizations to adhere to the saaasg pattern.

These modifications allow backends like Vulkan to adopt relaxed explicit syncs for HtoD copies and graph execution, while maintaining stricter checks for CPU-to-CUDA async copies.