Inference efficiency
arxiv arXiv cs.AI · 1d ago

Empirical Study of OpenPangu Quantization on Ascend NPUs

A controlled study evaluates OpenPangu 1B and 7B models on Huawei Ascend 910B1 NPUs using weight-only and weight-activation quantization methods. Results show 8-bit weight-only quantization is lossless for both models, while 4-bit quantization is practical for 7B but harmful for 1B on reasoning, math, and code tasks. Ultra-low precision methods like 2-bit and binary fail, and W4A4 SmoothQuant produces non-finite perplexity, indicating extreme low-bit compression remains challenging.

arxiv arXiv cs.AI · 1d ago

MoE Models Show Device-Dependent Inference Performance

An empirical study finds that Mixture-of-Experts models do not consistently outperform dense models on consumer or edge hardware. On the Apple M2 Pro, OLMoE-1B-7B is only 10% slower than a comparable dense model, while on the NVIDIA Jetson Orin Nano, it is 31% slower with 2.1 times higher energy per token, due to memory and KV-cache constraints. The results indicate that sparse activation benefits are limited by total-parameter memory footprint, especially on bandwidth-bound edge devices.

media r/LocalLLaMA · 1d ago

New Qwen-27B IQ4_KS and IQ4_KS_KT Quantizations for ik_llama.cpp

Two new GGUF quantizations for Qwen-27B have been released for ik_llama.cpp, optimized for 16GB VRAM on NVIDIA GPUs. The first, Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf, improves logical reasoning at the cost of general knowledge, with a perplexity of 7.4131. The second, Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf, applies Trellis quantization (iq4_kt) selectively to tensors with near-Gaussian distributions, achieving a perplexity of 7.4091, showing minimal performance degradation.

media r/LocalLLaMA · 2d ago

OpenRouter model prices imply heavier quantization

OpenRouter's model pricing suggests significant model quantization, as raw inference costs exceed API prices without high throughput or optimized serving. The author argues that unless providers achieve much better efficiency or offer premium, high-fidelity access, quantization likely degrades output quality—especially in complex tasks like planning and coding—raising concerns about transparency and access to true model capability.

media r/LocalLLaMA · 2d ago

KLD Analysis of KV Cache Quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

A detailed analysis maps the KLD (Kullback-Leibler divergence) of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B models. Results show q8/q8 quantization is nearly lossless on both models, while q4/q4 performs well on Qwen but causes severe degradation on Gemma. Turbo quantization variants show mixed performance, with turbo3 and turbo2 enabling extreme cache compression at significant accuracy cost.