Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 10

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

HyperQuant is a unified post-training quantization pipeline designed for the weights and KV cache of large language and diffusion transformers, combining Hadamard transforms with optimal lattice quantization. The method outperforms recent schemes like HIGGS, TurboQuant, and OCTOPUS across various bit rates while maintaining near-lossless quality.

arxiv arXiv cs.AI · 9h ago

GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

Researchers propose GRINQH, a weight-only post-training quantization framework that accelerates large language model decoding by unifying quantization and sparsification. The method leverages activation magnitudes to dynamically assign weight channels to different precision levels, addressing the memory-bound nature of the decoding stage.

media r/LocalLLaMA · 9h ago

LFM2.5 230M Runs In-Browser at 1,400 tok/s via Custom WebGPU Kernels

The LiquidAI LFM2.5-230M model is now running locally in the browser using custom WebGPU kernels. These specialized kernels were originally developed by Fable 5 prior to its shutdown and Opus 4.8. The demonstration was recorded on an M4 Max device, achieving a generation speed of 1,400 tokens per second. All processing occurs entirely within the user's browser environment without external server dependencies. A GGUF version of the model is available for download on Hugging Face alongside the standard checkpoint. Users can interact with the live demo hosted by the webml-community on Hugging Face Spaces.

arxiv arXiv cs.AI · 10h ago

Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

The authors present a framework for modeling the energy consumption of Transformer training across multiple GPUs, addressing the need for sustainable system design as computational costs rise. By conducting controlled architectural sweeps on BERT models, they relate measured energy usage to lightweight proxies for compute, memory traffic, and hardware efficiency. The approach is inspired by roofline models and incorporates a speedup-based hardware-efficiency factor to account for tensor parallelism and fully sharded data parallelism. This methodology allows for the derivation of a scaling law model that accurately predicts training energy across heterogeneous configurations. The work highlights the critical importance of predicting energy consumption as model size and parallelism scale. It provides a practical tool for cost-aware design in large-scale natural language processing systems.

arxiv arXiv cs.AI · 11h ago

Kamera: Training-Free Position-Invariant Multimodal KV Cache for Efficient Reuse

The authors introduce Kamera, a method that enables training-free reuse of multimodal key-value caches by addressing the loss of cross-chunk conditioning in naive prefix caching. Standard state-merge recovers direct readouts but fails to preserve the diffuse, low-rank residue in deep layers essential for multi-hop reasoning, which halves accuracy. To repair this, Kamera stores a small, training-free low-rank conditioning patch alongside each position-free chunk. This approach allows exact RoPE re-rotation and cross-chunk binding restoration across MLA, GQA, and MHA attention mechanisms. The system supports cheap reorder, sliding-window survival, and recall operations without requiring re-encoding of evicted chunks. Experiments show that a rank-m patch recovers full task accuracy on cross-chunk-binding benchmarks like MM-NIAH and two-page doc-QA. The solution reconstructs re-prefill KV to within bf16 rounding in a production SGLang kernel across six backbones while maintaining a fraction of the original KV footprint.

media r/LocalLLaMA · 12h ago

llama.cpp b9788 adds SYCL tensor split support for Intel GPUs

The llama.cpp project has released version b9788, which introduces support for the --split-mode tensor option within its SYCL backend. This update specifically targets users running inference on Intel graphics processing units. The feature is implemented through pull request #24152 in the ggml-org repository. It enables the splitting of model tensors across multiple devices rather than relying solely on layer-based distribution. The release notes explicitly invite users with dual Intel GPU setups to test this new functionality. Contributors are encouraged to provide performance benchmarks to validate the improvements. This addition aims to enhance multi-GPU utilization for compatible Intel hardware configurations.

media r/LocalLLaMA · 12h ago

GLM 5.2 runs at 12t/s on dual RTX 5090 hardware

A user tested the unsloth quantized version of GLM 5.2 on a high-end consumer workstation featuring dual RTX 5090 GPUs and a Zen5 Threadripper Pro processor. The system utilized 512GB of DDR5 ECC RAM and was configured with specific llama.cpp compilation flags to enable CUDA optimizations and unified memory handling. The model weights were loaded from the UD-Q5_K_S quantization, which totaled approximately 492GB across multiple GGUF files. Performance testing involved running the llama-server with a context size of 32768 tokens and specific threading parameters for NUMA isolation. The benchmark results consistently showed an inference speed of 12 tokens per second during chat interactions without agentic workflows. Additional experiments revealed that omitting certain optimization flags, such as flash attention or NUMA settings, produced negligible changes in throughput.

media r/LocalLLaMA · 17h ago

Backtrack Sampler and Verifier Drastically Improve Tiny Model Coding Performance

A new backtrack sampler combined with a verifier model significantly enhances the coding performance of tiny 0.5B parameter models, potentially making them competitive with larger 2-4B class models without weight changes. The approach theoretically addresses hallucination issues in large models by correcting errors during generation through re-sampling. However, this method incurs a 5-30% decode speed penalty due to the need for backward passes and requires training a verifier model of similar size to the original. This requirement doubles VRAM usage and increases compute demands by 1.5 to 3 times compared to standard inference. Despite these costs, the verifier generalizes across models of equal or lower weight classes if trained on diverse data distributions. Training the verifier is highly efficient, requiring only approximately 0.01% of the token size used for full pre-training.

media r/LocalLLaMA · 17h ago

NVIDIA Releases Nemotron-TwoTower-30B-A3B, a Diffusion-Based Language Model

NVIDIA has released the Nemotron-TwoTower-30B-A3B-Base-BF16 model, which is built upon the Nemotron 3 Nano 30B-A3B backbone. This architecture diverges from standard autoregressive models by utilizing a frozen context tower alongside a diffusion denoiser tower. The system iteratively fills blocks of tokens in parallel rather than generating them strictly one at a time. According to NVIDIA, this default mask-diffusion setup retains 98.7% of the aggregate benchmark quality found in the autoregressive baseline. Despite maintaining high quality, the model achieves 2.42 times its wall-clock generation throughput. The release highlights a novel approach to language modeling that combines diffusion techniques with large-scale language capabilities.

media r/LocalLLaMA · 17h ago

GLM 5.2 on Dual Strix Halo (256GB): Worth it?

A Reddit user named Intrepid_Rub_3566 has shared a video review evaluating the performance of GLM 5.2 running on a dual AMD Strix Halo setup with 256GB of RAM. The discussion centers on whether this specific hardware configuration provides sufficient value for local large language model inference. The content highlights the technical feasibility of deploying GLM 5.2 in such an environment, focusing on resource utilization and speed. Viewers are directed to a YouTube link for detailed benchmarks and performance metrics. The thread also includes community comments discussing the practicality and cost-effectiveness of this dual-GPU approach.

media r/LocalLLaMA · 17h ago

User Reports Inferior Quality and Efficiency with MTP Models in Qwen 3.6 and Gemma 4

A user testing self-hosted Qwen 3.6 27B and Gemma 4 models on four RTX 5070 Ti cards reports that Multi-Token Prediction (MTP) degrades output quality compared to non-MTP variants. In code review tasks, the non-MTP model produced more detailed findings with fix suggestions while consuming fewer tokens than its MTP counterpart. Performance metrics showed the non-MTP setup achieving approximately 2000 prompt processing tokens per second and 50-60 token generation speed. Conversely, the MTP configuration yielded higher generation speeds of 100-120 tg/s but lower prompt processing rates around 1300 pp/s. Despite the higher generation throughput, real-world agent task completion times were only about 20% faster with MTP due to increased context consumption. The user utilized llama.cpp with specific GGUF files from Unsloth and noted similar negative experiences when testing Gemma 4.

media r/LocalLLaMA · 17h ago

Developer Requests Testing for MTP Support in GLM-4.7-Flash via llama.cpp

A developer is seeking community assistance to test Multi-Token Prediction (MTP) support for the GLM-4.7-Flash model within the llama.cpp framework. The author acknowledges that previous models like GLM Air and GLM Flash are outdated but expresses a personal interest in enabling MTP for them. The request specifically targets users who possess the necessary hardware to run GLM-4.7-Flash and have the technical ability to compile llama.cpp from source. Participants are asked to evaluate the functionality of the provided GGUF model and report any encountered issues. Additionally, testers are requested to measure and share the performance speed gains achieved through MTP implementation. The developer has uploaded the test model to a Hugging Face repository for immediate access. Users requiring smaller quantization options are invited to contact the author directly for alternative versions.

github llama.cpp · 18h ago

llama.cpp b9788 adds SYCL tensor parallelism for dual-GPU setups

The llama.cpp release b9788 introduces support for tensor parallelism via the --split-mode tensor flag in the SYCL backend. This implementation enables dual-GPU communication by adding comm_init, comm_free, and comm_allreduce_tensor functions to the meta-backend. For two devices, it uses a ring all-reduce strategy that switches between FP32 direct memcpy for small tensors and BF16 compression for larger ones. The code avoids OneCCL due to its single-device-per-process limitation, instead using persistent buffers to maintain SYCL pool invariants. Performance tests on dual Intel Arc Pro B70 GPUs show significant speedups over layer mode for Llama-3.3-70B and Qwen3-Coder-Next-80B-A3B models. The update includes new binaries for macOS, Linux, Windows, Android, and openEuler across CPU, CUDA, ROCm, Vulkan, and SYCL targets.

media r/LocalLLaMA · 20h ago

Reddit Inquiry on Running Large Models with 4x-8x RTX 6000 PROs

A Reddit user is seeking community feedback regarding the performance of large language models on systems equipped with four to eight NVIDIA RTX 6000 PRO GPUs. The inquiry specifically targets users who have between 384GB and 768GB of VRAM available for running models such as GLM 5.2, Kimi 2.7, and DeepSeek V4 Pro. The poster notes that while these models can technically run at 4-bit quantization, they may not fit within the memory constraints when using 8-bit precision. They reference a benchmark repository but highlight that it lacks data for the most recent model releases. A key concern raised is whether the performance degradation from using 4-bit versus 8-bit quantization is significant enough to impact agentic or programming tasks. The user also asks which inference backends, such as vLLM or SGLang, are currently being utilized by others in this hardware configuration.

arxiv arXiv cs.CL · 1d ago

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

github llama.cpp · 1d ago

llama.cpp b9785 Release with Hardened Caps Check and Multi-Platform Binaries

The llama.cpp project has released version b9785, featuring a code change to harden caps checks as detailed in pull request #24973. This update provides pre-built binaries for macOS Apple Silicon, Intel Macs, and iOS via an XCFramework, with KleidiAI support disabled on Apple Silicon. Linux distributions including Ubuntu are supported for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends across x64, arm64, and s390x architectures. Android users can access arm64 CPU binaries, while Windows offers extensive options covering CPU, OpenCL Adreno, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also includes builds for openEuler targeting x86 and aarch64 processors with ACL Graph support. A standalone UI package is available alongside the platform-specific releases to facilitate local model inference.

media r/LocalLLaMA · 1d ago

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced Released with MTP Speed Boosts

HauhauCS has released two new uncensored, balanced versions of the Gemma 4 models: Gemma4-26B-A4B and Gemma4-31B-QAT. Both variants incorporate Multi-Token Prediction (MTP) draft heads to enable speculative decoding, resulting in significant inference speed improvements. The 26B-A4B model achieves approximately a 35% speed boost, while the 31B model sees a 53% increase, with identical output quality verified by the model's drafting mechanism. These releases utilize QAT-aware quantization, making Q4_K_M the optimal format as higher precision offers no quality gains for these specific models. The 26B-A4B is a Mixture of Experts architecture with roughly 4 billion active parameters per token, whereas the 31B variant is a dense model offering higher capability for users with sufficient VRAM. Both models include vision support via mmproj files and maintain a 262K context window. The author notes that GenRM testing resulted in zero refusals across 465 prompts, confirming their uncensored nature.

media r/LocalLLaMA · 1d ago

GLM-5.2 on 4x DGX Spark: Reconstructing Missing Build Steps for MTP Speculative Decode

The author successfully deployed GLM-5.2 with MTP speculative decode on a cluster of four NVIDIA GB10 (DGX Spark) nodes, achieving approximately 9.4 tokens per second. This setup utilizes vLLM with tensor parallelism, ported sparse-MLA Triton kernels, and a deterministic 15% expert pruning to fit AWQ-INT4 weights. A critical finding is that the original Docker image build instructions are incomplete, requiring reconstruction of missing patches for deep_gemm.py and sparse_attn_indexer.py. The author also identified that using any vLLM version other than the specific pinned commit causes real AWQ weights to crash during loading due to CUDA errors. To replicate the environment, users must apply a custom script that bakes in kernels and routes functions to sm12x fallbacks. Performance benefits include roughly double the speed of previous llama.cpp implementations, though inter-node bandwidth remains a bottleneck for dual-rail scaling.

media r/LocalLLaMA · 1d ago

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

Gefen is presented as a drop-in replacement for the AdamW optimizer, claiming an eightfold reduction in memory usage during training. The project includes a GitHub repository available at ndvbd/Gefen and a corresponding research paper hosted on arXiv under the identifier 2606.13894. This submission highlights Gefen's potential to optimize resource efficiency for machine learning workflows. The provided source material links directly to the technical documentation and codebase for further verification. No additional performance metrics or comparative benchmarks are detailed in the available text.

media Hugging Face Forums · 1d ago

Qwen3/Gemma3 Candle Skips Attention Masks for Equal-Length Batches in CPU Mode

A user has reported a critical bug in the Hugging Face text-embeddings-inference library affecting Qwen3 and Gemma3 models. The issue arises when running inference on CPUs with concurrent requests, leading to significant accuracy degradation. Specifically, the Candle backend incorrectly skips attention masks for batches where all input sequences have equal lengths. This defect compromises the reliability of embeddings generated under these specific conditions. To address the problem, the author submitted a pull request containing a fix that was thoroughly tested on their local machines. The bug highlights potential stability risks in CPU-based embedding services handling batched inputs.