Inference efficiency — korshunov.ai

Inference efficiency Page 5 / 10

llama.cpp release b9738: fixes CORS auth header forwarding and new binary builds

llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 6d ago

ggml optimizes AMX with partition flattening

The ggml project has optimized AMX performance by flattening the partition over n_batch * M, ensuring all threads participate in quantization. This change improves speed by up to 1.47x across various models and hardware configurations on CPU and GPU platforms, with results showing consistent gains in inference time.

github llama.cpp · 6d ago

GLM-5.2 DSA indexer fix: tensors marked not required

The GLM-5.2 model's DSA indexer was incorrectly loaded on all layers, causing failures due to missing tensors. The update marks indexer tensors as TENSOR_NOT_REQUIRED, allowing layers without an indexer to load as nullptr and enabling full MLA attention. DeepSeek-V3.2, with uniform indexing, is unaffected.

media r/LocalLLaMA · 6d ago

Best Settings for 48GB VRAM with Qwen 3.6 27B

A user shares optimized settings for running Qwen 3.6 27B with Q8_0 quantization on an RTX 4090 and RTX 3090 setup using llama.cpp. The configuration includes tensor split, 999 layers on GPU, 250k context, speculative decoding, and unified KV cache, achieving 75-100t/s throughput with vision and MTP support.

media r/LocalLLaMA · 6d ago

7900XTX 24GB VRAM Runs Qwen 3.6 27B with 131k Context

A user reports successfully running a Qwen 3.6 27B model with Q6K+MTP quantization and 131k context length on a 7900XTX with 24GB VRAM. This is achieved using kvcache quantization (Q5_0/Q4_0), which reduces VRAM usage by 12% compared to Q8, enabling the model to run at 55-60 tokens per second with specific compile flags and llama.cpp arguments.

media r/LocalLLaMA · 6d ago

GLM 5.2 Achieves 98% Max Intelligence with Less Than Half Tokens

GLM 5.2 demonstrates 98% of maximum intelligence in coding tasks using less than half of its total token budget, according to a technical report by z_ai. The model's reasoning efficiency has improved significantly, with token usage increasing from 16.7k to 36.7k between GLM 5.1 and GLM 5.2, though high-level settings may strain local hardware performance.

media r/LocalLLaMA · 6d ago

llama.cpp B70 SYCL Benchmarks Results

Benchmarks show llama.cpp B70 with SYCL backend performs well on models like gemma4 12B and 26B, achieving throughput of up to 5662.45 t/s for the E2B model. Performance drops significantly in tg128 mode, with qwen35 27B reaching only 15.42 t/s, indicating room for optimization.

media r/LocalLLaMA · 6d ago

Local agent on 4090 - looking for LM Studio settings

A user reports slow token generation when running a local agent on a 4090 with 24GB VRAM, despite adjusting context and batching settings. They note Gemma4 performs faster but produces incorrect tokens like <code></tool_call></code>, and seek recommended settings and explanations for parameters such as top_p and top_k.

media r/LocalLLaMA · 6d ago

RTX 5090 MSI Power Usage and Cable Warning

The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.

github llama.cpp · 6d ago

ggml-webgpu Adds F16 Adapter Toggles for Vulkan and NVIDIA

The ggml-webgpu project has added adapter toggles for half-precision (F16) support on Vulkan and NVIDIA GPUs. This update enables improved performance on compatible hardware across multiple platforms, including macOS, Linux, Android, Windows, and openEuler, with specific builds available for ARM and x64 architectures.

media r/LocalLLaMA · 6d ago

$1800 GPU cost runs Qwen3.6-27B with 262K context and 55 tok/s

A setup using four 5060 Ti GPUs (totaling $1800) achieves 55 tokens per second with Qwen3.6-27B-FP8, supporting 262K context length and bfloat16 KV cache. The configuration uses P2P and FlashInfer, with benchmark results showing 55.67 output token throughput and 65.25% speculative decoding acceptance rate.

github llama.cpp · 6d ago

llama.cpp Release b9731: Performance Optimization and Cross-Platform Binaries

llama.cpp version b9731 introduces optimization using std::partial_sort to reduce token sorting overhead, improving performance from 8.555ms to 0.704ms for top-n token selection. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

media r/LocalLLaMA · 6d ago

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 6d ago

Maximizing Performance of 2x3090 with NVLink

A user reports achieving only 60 tokens per second in short bursts and average 40-45 TPS when running Qwen 3.6 27B with Q8_0 quantization on two GeForce 3090 GPUs connected via NVLink. The setup includes Ubuntu 24.04, Ryzen 7950x3D, and 64GB DDR5, with display routed through an eGPU.

github llama.cpp · 6d ago

LLaMA.cpp Release b9729: New Binaries and Platform Support

LLaMA.cpp releases version b9729 with binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, along with a new UI package. Internal references to 'webui' have been removed.

media r/LocalLLaMA · 6d ago

How to Set Optimal llama.cpp Parameters for AMD GPU

Users seeking optimal llama.cpp settings for gemma 4 models on an AMD GPU with 16GB VRAM ask whether trial and error is necessary. They reference Google's default settings for temperature, top-p, and top-k but note inconsistent results, indicating a need for more targeted guidance beyond official documentation.

media r/LocalLLaMA · 6d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.

github llama.cpp · 6d ago

LLaMA.cpp Release b9728 Adds Comment Line Support and Multiple Platform Binaries

LLaMA.cpp version b9728 introduces support for comment lines in --api-key-file configuration. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 6d ago

Best Harness for Web Searching

Users report that tools like LM Studio and Odysseus are limited by search engine request caps, often at 10 per day or hour, without API access. They suggest creating DuckDuckGo API accounts for better search access, but note that frontends rarely prompt for this. The post asks whether Hermes or Pi offer improved solutions.

media r/LocalLLaMA · 6d ago

Is My CPU and RAM Too Weak for Local LLMs?

A user reports their CPU and RAM are reaching 100% during simple test prompts, while the GPU is underutilized. They question whether their RTX 3050 8GB GPU can run Quen3.5:9b locally, noting that in theory it should be feasible.