Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 9

GLM 5.2 Local Inference Speeds Report

Users reporting local GLM 5.2 inference speeds using llama.cpp on 6x RTX 3090 with 128GB DDR5 and an i7-13700K achieve 7.8 tokens/sec at 90K context size with Q8_0 quantization. Prompt processing occurs at approximately 40 tokens/sec.

github llama.cpp · 4d ago

llama.cpp Release b9741 Adds New Binaries and Support

llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.

media r/LocalLLaMA · 4d ago

Free 15-Part Series on LLM Internals Grounded in Gemma 4 12B

I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.

github llama.cpp · 5d ago

Fix for test-args-parser random failures on Windows

A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.

media r/LocalLLaMA · 5d ago

You can now convert EXL3 quants on Apple Silicon Mac

Users can now convert and run EXL3 quantized models on Apple Silicon Macs with 64GB+ RAM. Tests show that models like MiniCPM5 and Qwen3.6-27B achieve performance on par with or slightly behind RTX-card-based conversions, with EXL3 offering superior quantization quality compared to MLX.

media r/LocalLLaMA · 5d ago

Napkin math on collective hosting costs for diffusiongemma in 2026

A cost analysis estimates that hosting diffusiongemma at different user token levels results in monthly costs per user ranging from 1.7€ to 122.8€. The study finds agentic AI usage is economically unsustainable for collective hosting, though costs could decrease with new GPUs or ASICs and a shorter GPU depreciation period.

github llama.cpp · 5d ago

llama.cpp release b9738: fixes CORS auth header forwarding and new binary builds

llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 5d ago

ggml optimizes AMX with partition flattening

The ggml project has optimized AMX performance by flattening the partition over n_batch * M, ensuring all threads participate in quantization. This change improves speed by up to 1.47x across various models and hardware configurations on CPU and GPU platforms, with results showing consistent gains in inference time.

github llama.cpp · 5d ago

GLM-5.2 DSA indexer fix: tensors marked not required

The GLM-5.2 model's DSA indexer was incorrectly loaded on all layers, causing failures due to missing tensors. The update marks indexer tensors as TENSOR_NOT_REQUIRED, allowing layers without an indexer to load as nullptr and enabling full MLA attention. DeepSeek-V3.2, with uniform indexing, is unaffected.

media r/LocalLLaMA · 5d ago

Best Settings for 48GB VRAM with Qwen 3.6 27B

A user shares optimized settings for running Qwen 3.6 27B with Q8_0 quantization on an RTX 4090 and RTX 3090 setup using llama.cpp. The configuration includes tensor split, 999 layers on GPU, 250k context, speculative decoding, and unified KV cache, achieving 75-100t/s throughput with vision and MTP support.

media r/LocalLLaMA · 5d ago

7900XTX 24GB VRAM Runs Qwen 3.6 27B with 131k Context

A user reports successfully running a Qwen 3.6 27B model with Q6K+MTP quantization and 131k context length on a 7900XTX with 24GB VRAM. This is achieved using kvcache quantization (Q5_0/Q4_0), which reduces VRAM usage by 12% compared to Q8, enabling the model to run at 55-60 tokens per second with specific compile flags and llama.cpp arguments.

media r/LocalLLaMA · 5d ago

GLM 5.2 Achieves 98% Max Intelligence with Less Than Half Tokens

GLM 5.2 demonstrates 98% of maximum intelligence in coding tasks using less than half of its total token budget, according to a technical report by z_ai. The model's reasoning efficiency has improved significantly, with token usage increasing from 16.7k to 36.7k between GLM 5.1 and GLM 5.2, though high-level settings may strain local hardware performance.

media r/LocalLLaMA · 5d ago

llama.cpp B70 SYCL Benchmarks Results

Benchmarks show llama.cpp B70 with SYCL backend performs well on models like gemma4 12B and 26B, achieving throughput of up to 5662.45 t/s for the E2B model. Performance drops significantly in tg128 mode, with qwen35 27B reaching only 15.42 t/s, indicating room for optimization.

media r/LocalLLaMA · 5d ago

Local agent on 4090 - looking for LM Studio settings

A user reports slow token generation when running a local agent on a 4090 with 24GB VRAM, despite adjusting context and batching settings. They note Gemma4 performs faster but produces incorrect tokens like <code></tool_call></code>, and seek recommended settings and explanations for parameters such as top_p and top_k.

media r/LocalLLaMA · 5d ago

RTX 5090 MSI Power Usage and Cable Warning

The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.

github llama.cpp · 5d ago

ggml-webgpu Adds F16 Adapter Toggles for Vulkan and NVIDIA

The ggml-webgpu project has added adapter toggles for half-precision (F16) support on Vulkan and NVIDIA GPUs. This update enables improved performance on compatible hardware across multiple platforms, including macOS, Linux, Android, Windows, and openEuler, with specific builds available for ARM and x64 architectures.

media r/LocalLLaMA · 5d ago

$1800 GPU cost runs Qwen3.6-27B with 262K context and 55 tok/s

A setup using four 5060 Ti GPUs (totaling $1800) achieves 55 tokens per second with Qwen3.6-27B-FP8, supporting 262K context length and bfloat16 KV cache. The configuration uses P2P and FlashInfer, with benchmark results showing 55.67 output token throughput and 65.25% speculative decoding acceptance rate.

github llama.cpp · 5d ago

llama.cpp Release b9731: Performance Optimization and Cross-Platform Binaries

llama.cpp version b9731 introduces optimization using std::partial_sort to reduce token sorting overhead, improving performance from 8.555ms to 0.704ms for top-n token selection. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

media r/LocalLLaMA · 5d ago

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 5d ago

Maximizing Performance of 2x3090 with NVLink

A user reports achieving only 60 tokens per second in short bursts and average 40-45 TPS when running Qwen 3.6 27B with Q8_0 quantization on two GeForce 3090 GPUs connected via NVLink. The setup includes Ubuntu 24.04, Ryzen 7950x3D, and 64GB DDR5, with display routed through an eGPU.