llama.cpp b9827 release adds CUDA 2D async copy optimization

The llama.cpp b9827 release introduces a performance optimization for CUDA by adding a cudaMemcpy2DAsync fast path to the ggml_cuda_cpy function. This change accelerates same-type, same-shape strided copies where tensors are not fully contiguous but each row is contiguous, replacing slower element-wise scalar copy kernels.

Implements a fast path for 2D pitched block copies in CUDA to improve performance on non-contiguous tensors.
Fixes GDN recurrent snapshot updates when using -np 4 by addressing rollback slot separation issues.
Adds new tests to validate the optimized strided copy path.
Returns unsupported status for strided copies in OpenVINO due to failing new tests.
Disables macOS Apple Silicon (arm64, KleidiAI enabled) builds for this release.

This update enhances inference efficiency on CUDA devices by reducing overhead during specific tensor copy operations and resolves stability issues in GDN recurrent processing.