llama.cpp b9828 release: OpenCL Flash Attention improvements and new binaries

The llama.cpp b9828 release introduces significant OpenCL enhancements, specifically reworking the Flash Attention kernels for f16 and f32 precision. This update includes new prefill prepass kernels and support for q4_0 and q8_0 quantization formats.

Reworked FA kernel for f16 and f32 with optimized tile padding and masking logic.
Added FA kernels for q4_0 and q8_0 quantization, including dequant kernels and SOA tensor support.
Introduced a FA tile tuning table with override capabilities and fixed infinity handling for -cl-finite-math-only.
Provided pre-built binaries for macOS (Apple Silicon/Intel), Linux (CPU/Vulkan/ROCm/OpenVINO/SYCL), Windows (CPU/CUDA/Vulkan/HIP/OpenVINO/SYCL), Android, and openEuler.

This release enables more efficient inference on OpenCL-compatible hardware by optimizing memory access patterns and supporting additional quantization types.