Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 9

Fast-TurboQuant: Multiplier-Free Vector Quantization

Fast-TurboQuant introduces a multiplier-free projection method using a structured fast Johnson-Lindenstrauss transform. It replaces dense random rotation matrices with Rademacher phase inversion and fast Walsh-Hadamard transform, reducing arithmetic to only additions and improving Recall@10 with lower mean squared error.

media Hugging Face Forums · 23h ago

Best model for local usage and working on Unity with MCP at 12 GB VRAM

A user is seeking a lightweight LLM tailored for Unity 6.5 with MCP, operating within 12 GB VRAM. They currently rely on free tiers of Cursor and Claude but find them insufficient, asking if any specialized models exist or alternative solutions are available.

arxiv arXiv cs.CL · 1d ago

Posterior Refinement: Fast Language Generation via Any-Order Flow Maps

FMLM+ introduces Posterior Refinement, a strategy that enables adaptive self-correction during inference. By combining flow map transport with masking-style noise schedules, it achieves high-fidelity language generation with 32x fewer noise-free evaluations, outperforming both MDM and FMLM in speed-quality tradeoff.

github llama.cpp · 1d ago

llama.cpp release b9776 adds Vulkan and multiple hardware support

llama.cpp version b9776 introduces Vulkan support for Linux and Windows, along with CPU, OpenCL, CUDA, and SYCL variants across macOS, Linux, Android, and Windows. The release also includes support for OpenVINO and ROCm, with UI available in a standalone package.

arxiv arXiv cs.AI · 1d ago

Recency/Frequency Adaptive KV Caching for LLM Serving

A new KV caching method dynamically allocates cache space between recently and frequently used blocks to improve efficiency. It boosts KV cache hit rate by up to 10.8% and reduces time to first token by up to 12.6% on synthetic workloads, with 2.1% and 2.0% gains on real-world conversation tasks.

arxiv arXiv cs.AI · 1d ago

ACE-GS: Efficient and Accurate 3D Gaussian Splatting

ACE-GS introduces a progressive optimization framework that achieves accurate, compact, and efficient 3D Gaussian Splatting. It enables up to 3.7 times faster training than Speedy-Splat, with a 0.89 dB PSNR improvement over original 3DGS, while maintaining high structural similarity and a compact scene representation.

arxiv arXiv cs.AI · 1d ago

Empirical Study of OpenPangu Quantization on Ascend NPUs

A controlled study evaluates OpenPangu 1B and 7B models on Huawei Ascend 910B1 NPUs using weight-only and weight-activation quantization methods. Results show 8-bit weight-only quantization is lossless for both models, while 4-bit quantization is practical for 7B but harmful for 1B on reasoning, math, and code tasks. Ultra-low precision methods like 2-bit and binary fail, and W4A4 SmoothQuant produces non-finite perplexity, indicating extreme low-bit compression remains challenging.

media r/LocalLLaMA · 1d ago

Mimo 2.5 is fast at large context on dual RTX Pro 6000

Mimo 2.5 maintains fast performance at large context lengths on dual RTX Pro 6000 cards using a 5-to-1 local/global sliding-window attention mechanism, similar to Gemma 3. It completes tasks in about 4 minutes, significantly faster than MiniMax M3, which takes around 40 minutes, despite both models having similar quality under VRAM limits.

arxiv arXiv cs.AI · 1d ago

SwarmX: Agentic Scheduling for Low-Latency Systems

SwarmX introduces neural predictors to enable prompt-aware scheduling in agentic AI systems. It reduces tail latency by up to 61.5% and maintains up to 2x the throughput of production schedulers under the same service level objectives.

arxiv arXiv cs.AI · 1d ago

MoE Models Show Device-Dependent Inference Performance

An empirical study finds that Mixture-of-Experts models do not consistently outperform dense models on consumer or edge hardware. On the Apple M2 Pro, OLMoE-1B-7B is only 10% slower than a comparable dense model, while on the NVIDIA Jetson Orin Nano, it is 31% slower with 2.1 times higher energy per token, due to memory and KV-cache constraints. The results indicate that sparse activation benefits are limited by total-parameter memory footprint, especially on bandwidth-bound edge devices.

media r/LocalLLaMA · 1d ago

New Qwen-27B IQ4_KS and IQ4_KS_KT Quantizations for ik_llama.cpp

Two new GGUF quantizations for Qwen-27B have been released for ik_llama.cpp, optimized for 16GB VRAM on NVIDIA GPUs. The first, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf, improves logical reasoning at the cost of general knowledge, with a perplexity of 7.4131. The second, Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, applies Trellis quantization (iq4_kt) selectively to tensors with near-Gaussian distributions, achieving a perplexity of 7.4091, showing minimal performance degradation.

media r/LocalLLaMA · 2d ago

OpenRouter model prices imply heavier quantization

OpenRouter's model pricing suggests significant model quantization, as raw inference costs exceed API prices without high throughput or optimized serving. The author argues that unless providers achieve much better efficiency or offer premium, high-fidelity access, quantization likely degrades output quality—especially in complex tasks like planning and coding—raising concerns about transparency and access to true model capability.

media r/LocalLLaMA · 2d ago

GLM 5.2 on Mac Studio Speedup PR

GLM 5.2 delivers improved prefill speeds exceeding 100 t/s at higher context lengths. The update reduces memory usage, enabling 4-bit quantized models to handle over 100k context tokens efficiently. This enhancement is detailed in a PR by the oMLX creator.

media r/LocalLLaMA · 2d ago

KLD Analysis of KV Cache Quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

A detailed analysis maps the KLD (Kullback-Leibler divergence) of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B models. Results show q8/q8 quantization is nearly lossless on both models, while q4/q4 performs well on Qwen but causes severe degradation on Gemma. Turbo quantization variants show mixed performance, with turbo3 and turbo2 enabling extreme cache compression at significant accuracy cost.

github llama.cpp · 2d ago

Vulkan backend updates and new binary releases for llama.cpp

llama.cpp release b9774 adds Vulkan backend support for SQR, SQRT, SIN, COS, CLAMP, LEAKY_RELU, and NORM operations, with support for noncontiguous inputs. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and backends including CUDA, OpenVINO, SYCL, and ROCm.

github llama.cpp · 2d ago

LLaMA.cpp Release b9775: New Binaries and Support for Multiple Platforms

LLaMA.cpp has released version b9775, introducing binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, with updated CUDA versions (12.4 and 13.3) and iOS XCFramework availability. A UI package is also provided.

media r/LocalLLaMA · 2d ago

Multi-Tier MoE Caching: Optimizing Expert Activation in Large Models

MoE models like GLM 5.2 and Deepseek V4 show that top 20% of experts handle 85% of activations. A multi-tier caching approach could shift these experts to GPU memory, leveraging high-bandwidth VRAM for faster inference. Existing systems such as PowerInfer, Lidenburg's llama.cpp, and HOBBIT demonstrate practical implementations of expert caching and prefetching.

github llama.cpp · 2d ago

LLaMA.cpp Release b9771 Adds Vulkan Support and Optimizations

LLaMA.cpp release b9771 introduces Vulkan support across Linux and Windows, reducing shader variants and binary size by making mul_mm ALIGNED a spec constant. The release includes binaries for macOS, Linux, Android, Windows, and openEuler, with variants for CPU, Vulkan, OpenVINO, SYCL, and ROCm.

github llama.cpp · 2d ago

Fix for Vulkan result checking and test linking in llama.cpp

llama.cpp now links ggml-cpu when GGML_VULKAN_CHECK_RESULTS or GGML_VULKAN_RUN_TESTS are enabled to resolve linking failures. This fix restores debug functionality for Vulkan result verification and testing after the ggml-cpu library was split.

arxiv arXiv cs.CL · 2d ago

SVD-Surgeon: Optimal Singular-Value Surgery for LLM Compression

SVD-Surgeon is a training-free method that applies the Optimal Brain Surgeon framework to singular-value decomposition. It computes a closed-form update for retained singular values to compensate for truncation, improving the perplexity-compression trade-off on OPT and LLaMA 2-7B models without retraining.