Inference efficiency — korshunov.ai

Topic · Inference efficiency

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

github llama.cpp · 5d ago

ggml-webgpu Adds F16 Adapter Toggles for Vulkan and NVIDIA

The ggml-webgpu project has added adapter toggles for half-precision (F16) support on Vulkan and NVIDIA GPUs. This update enables improved performance on compatible hardware across multiple platforms, including macOS, Linux, Android, Windows, and openEuler, with specific builds available for ARM and x64 architectures.

github llama.cpp · 5d ago

llama.cpp Release b9731: Performance Optimization and Cross-Platform Binaries

llama.cpp version b9731 introduces optimization using std::partial_sort to reduce token sorting overhead, improving performance from 8.555ms to 0.704ms for top-n token selection. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 5d ago

LLaMA.cpp Release b9729: New Binaries and Platform Support

LLaMA.cpp releases version b9729 with binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, along with a new UI package. Internal references to 'webui' have been removed.

github llama.cpp · 5d ago

LLaMA.cpp Release b9728 Adds Comment Line Support and Multiple Platform Binaries

LLaMA.cpp version b9728 introduces support for comment lines in --api-key-file configuration. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 5d ago

llama.cpp release b9718: consolidated slot selection and new binary builds

llama.cpp version b9718 consolidates slot selection into a single function, get_available_slot, while maintaining LCP similarity checks for prompt cache updates. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 5d ago

llama.cpp Release b9721 Available for Multiple Platforms

llama.cpp has released version b9721, offering binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, ROCm, OpenVINO, SYCL, and HIP support, with a dedicated UI package. A feature for Apple Silicon with KleidiAI is currently disabled.

github llama.cpp · 6d ago

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.

arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.LG · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.

arxiv arXiv cs.LG · 6d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.LG · 6d ago

StreamKL: Fast and Memory-Efficient KL Divergence for Attention Distillation

StreamKL introduces a fused GPU primitive that eliminates quadratic memory usage in attention distillation by streaming query-key tiles through on-chip SRAM. It achieves up to 43x speedup in forward and 14x in backward passes, reducing extra HBM footprint from O(N_QN_K) to O(1), enabling long-context distillation on a single GPU.

arxiv arXiv cs.CL · 6d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

github llama.cpp · 6d ago

llama.cpp Release b9703: Updates and Binary Downloads

llama.cpp version b9703 includes a rework of the server's preset handling, removing remote HF preset support and deprecated functions. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 6d ago

llama.cpp release b9704: fixes invalid grammar handling and adds new binaries

llama.cpp version b9704 now returns HTTP 400 for invalid grammar instead of silently dropping constraints. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware accelerators, with support for Vulkan, ROCm, OpenVINO, SYCL, and CUDA.

github llama.cpp · 6d ago

LLaMA.cpp Release b9698 Adds Self-Update Support and Multiple Platform Binaries

LLaMA.cpp version b9698 enables self-updates only when built with llama-install.sh. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

arxiv arXiv cs.LG · 7d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout introduces a self-speculative decoding framework that reduces rollout and end-to-end latency by up to 19.6% and 12.7% respectively, without compromising final model quality. It uses a quantized drafter derived from the target model and integrates a system-aware toggle policy to avoid compute-bound regimes, enabling effective speculation during evolving policy generations.

github llama.cpp · 7d ago

ggml-cpu: Conditionally enable POWER11 backend based on compiler support

The ggml-cpu project now conditionally enables the POWER11 backend in ggml based on compiler support for -mcpu=power11. This prevents build failures on current GCC/Clang toolchains while maintaining forward compatibility. Updates to CMakeLists.txt support this change, and -mcpu=power10 is used for both P10 and P11 architectures.

github llama.cpp · 7d ago

llama.cpp Release b9692 Adds New Binaries and Fixes

llama.cpp version b9692 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes updates to support Vulkan, ROCm, OpenVINO, SYCL, and HIP, with fixes to remove batch dim usage in llava_uhd.

github llama.cpp · 7d ago

Metal backend adds f16 and bf16 support for concat operator

The Metal backend in llama.cpp has been extended to support f16 and bf16 tensor types for the concat operator, in addition to existing f32 and i32 support. This update includes specialized kernel templates, updated pipeline getters, and improved type-based kernel dispatch, with assistance from pi:llama.cpp/Qwen3.6-27B.

Claude Code v2.1.181 Release Notes

ggml-webgpu Adds F16 Adapter Toggles for Vulkan and NVIDIA

llama.cpp Release b9731: Performance Optimization and Cross-Platform Binaries

LLaMA.cpp Release b9729: New Binaries and Platform Support

LLaMA.cpp Release b9728 Adds Comment Line Support and Multiple Platform Binaries

llama.cpp release b9718: consolidated slot selection and new binary builds

llama.cpp Release b9721 Available for Multiple Platforms

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

Execution-State Capsules for Low-Latency On-Device AI Serving

StreamKL: Fast and Memory-Efficient KL Divergence for Attention Distillation

Selective Verification for Budget-Aware Reasoning

llama.cpp Release b9703: Updates and Binary Downloads

llama.cpp release b9704: fixes invalid grammar handling and adds new binaries

LLaMA.cpp Release b9698 Adds Self-Update Support and Multiple Platform Binaries

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

ggml-cpu: Conditionally enable POWER11 backend based on compiler support

llama.cpp Release b9692 Adds New Binaries and Fixes

Metal backend adds f16 and bf16 support for concat operator