Inference efficiency — korshunov.ai

Topic · Inference efficiency

The Metal backend in llama.cpp has been extended to support f16 and bf16 tensor types for the concat operator, in addition to existing f32 and i32 support. This update includes specialized kernel templates, updated pipeline getters, and improved type-based kernel dispatch, with assistance from pi:llama.cpp/Qwen3.6-27B.

github llama.cpp · 7d ago

llama.cpp releases version b9688 with new APIs and cross-platform binaries

llama.cpp releases version b9688, adding model management and SSE realtime updates APIs. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration frameworks like Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 8d ago

LLaMA.cpp Release b9685 Adds SYCL Dev2Dev Memcpy and Multiple Platform Binaries

LLaMA.cpp version b9685 introduces SYCL-based dev2dev memcpy functionality, moving GGML_SYCL_DEV2DEV_MEMCPY to runtime table and improving peer-to-peer communication detection. The release includes precompiled binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs including Vulkan, ROCm, OpenVINO, and SYCL (FP32/FP16).

github llama.cpp · 8d ago

LLaMA.cpp Release b9684 Adds Conv_3D and Multiple Platform Binaries

LLaMA.cpp release b9684 introduces a new 3D convolution operation (conv_3d) and includes optimized implementations. The release provides prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across various architectures and hardware acceleration options, including SYCL, Vulkan, CUDA, and OpenVINO.

github llama.cpp · 8d ago

llama.cpp release b9682 adds Vulkan support and new platform binaries

llama.cpp version b9682 introduces Vulkan support for Linux and Windows, enabling GPU acceleration. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with CPU and GPU options including CUDA, OpenVINO, SYCL, and ROCm.

github llama.cpp · 8d ago

llama.cpp release b9675 adds FP16 support and new platform binaries

llama.cpp version b9675 enables FP16 support for operations like SQR, SQRT, LOG, SIN, COS, and CLAMP. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with support for Vulkan, ROCm, OpenVINO, SYCL (FP16 and FP32), and CUDA 12.4 and 13.3.

github llama.cpp · 8d ago

llama.cpp release b9680: new binaries and Vulkan support

llama.cpp releases version b9680 with updated Vulkan support and new binaries for macOS, Linux, Android, Windows, and openEuler. The release includes CPU and GPU variants for multiple architectures, with support for Vulkan, CUDA, OpenVINO, SYCL, and ROCm.

github llama.cpp · 9d ago

llama.cpp releases b96669 with backend sampling for Eagle3

llama.cpp version b9669 adds backend sampling support for Eagle3. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 9d ago

llama.cpp Release b9670: Fixes and New Builds

llama.cpp release b9670 includes fixes for NVFP4 edge cases in llama-graph, such as moving post-GEMM MUL operations and restricting build_ffn to supported combinations. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backend options, including CUDA, Vulkan, SYCL, and OpenVINO.

github llama.cpp · 9d ago

llama.cpp Release b9667 Adds Vulkan and CUDA Support

llama.cpp release b9667 introduces Vulkan support with S_v=16 via gated_delta_net. It includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures, with options for Vulkan, CUDA 12.4 and 13.3, ROCm, OpenVINO, and SYCL.

github llama.cpp · 9d ago

llama.cpp release b9665 adds --offline flag and new binary builds

llama.cpp version b9665 introduces a new --offline flag for benchmarking. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, ROCm, OpenVINO, and SYCL.

github llama.cpp · 9d ago

LLaMA.cpp Release b9663 Adds SYCL Support and New Binary Builds

LLaMA.cpp release b9663 adds support for OP EXPM1 and all unit test cases for FLOOR, TRUNC, and ROUND. It includes updated binaries for macOS, Linux, Android, Windows, and openEuler, with support for SYCL (FP32 and FP16), Vulkan, CUDA 12.4 and 13.3, and ROCm 7.2, along with an updated UI.

github llama.cpp · 9d ago

sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID

The sycl update extends support for reordered expert tensor handling in MoE MUL_MAT_ID to Q4_K, Q5_K, and Q6_K. Unsupported 3D reorder cases now fallback instead of aborting.

github llama.cpp · 9d ago

Vulkan adds col2im_1d op and supports multiple platforms

The llama.cpp release b9661 adds GGML_OP_COL2IM_1D support for Vulkan, using a bounded gather loop instead of full-K scan with modulo. It returns nullptr for unsupported types and includes builds for macOS, Linux, Android, Windows, and openEuler across CPU, Vulkan, CUDA, and SYCL.

arxiv arXiv cs.CL · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprints.

arxiv arXiv cs.CL · 9d ago

KVEraser: Efficient Localized Context Erasing in LLMs

KVEraser enables efficient localized context erasing in large language models by replacing only the KV cache states of an erased span with learned steering states. It achieves near-full-recomputation performance on in-domain tasks across 1K to 32K context lengths, with only a 24% latency increase, and outperforms other approximate methods in long-document QA with 3--4x speedup over full recomputation.

media r/LocalLLaMA · 7d ago

GLM-5.2-FP8 HGX-H200 SGLang Docker Deployment Config

A user shares a Docker configuration for running GLM-5.2-FP8 on HGX-H200 hardware using SGLang. The setup achieves 262k context length and 70 tokens per second with 8 tensor parallelism, using a memory fraction of 0.83. The user notes that vLLM official recipes do not work on H200 due to KV cache FP8 quantization limitations on the DSV3 architecture.

media r/LocalLLaMA · 8d ago

Gemma 4 E2B runs at 255 tok/s in browser using WebGPU

Gemma 4 E2B achieves 255 tokens per second in-browser on an M4 Max using WebGPU kernels. The demo and kernels are now available on Hugging Face for public use.

media r/LocalLLaMA · 8d ago

TRELLIS.2 now runs natively on MLX

TRELLIS.2 has been ported to run natively on MLX for Apple Silicon. The model supports 512x512 and 1024x1024 image inputs, with generation times of approximately 70 seconds for 512x517 and 300 to 700 seconds for 1024x1024 on an M4 Max with 128GB unified memory.

media Latent Space · 8d ago

GLM-5.2 Claims Top Position in Frontend Coding with Speculative Decoding

GLM-5.2, a 744B parameter model from Z.ai, has been evaluated as the top frontend coding model globally, outperforming all Opus versions including Opus 4.8. This achievement is highlighted in third-party evaluations that validate official offline tests, marking a significant milestone for a model of its size, particularly in the competitive frontend coding domain.

Metal backend adds f16 and bf16 support for concat operator

llama.cpp releases version b9688 with new APIs and cross-platform binaries

LLaMA.cpp Release b9685 Adds SYCL Dev2Dev Memcpy and Multiple Platform Binaries

LLaMA.cpp Release b9684 Adds Conv_3D and Multiple Platform Binaries

llama.cpp release b9682 adds Vulkan support and new platform binaries

llama.cpp release b9675 adds FP16 support and new platform binaries

llama.cpp release b9680: new binaries and Vulkan support

llama.cpp releases b96669 with backend sampling for Eagle3

llama.cpp Release b9670: Fixes and New Builds

llama.cpp Release b9667 Adds Vulkan and CUDA Support

llama.cpp release b9665 adds --offline flag and new binary builds

LLaMA.cpp Release b9663 Adds SYCL Support and New Binary Builds

sycl: support reordered Q4_K/Q5_K/Q6_K MoE MUL_MAT_ID

Vulkan adds col2im_1d op and supports multiple platforms

TokenPilot: Cache-Efficient Context Management for LLM Agents

KVEraser: Efficient Localized Context Erasing in LLMs

GLM-5.2-FP8 HGX-H200 SGLang Docker Deployment Config

Gemma 4 E2B runs at 255 tok/s in browser using WebGPU

TRELLIS.2 now runs natively on MLX

GLM-5.2 Claims Top Position in Frontend Coding with Speculative Decoding