Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 9

100 t/s on Qwen3.6-27B Q8 across 5090 + 3090 Ti with tensor split-mode

A user achieved 100 tokens per second on Qwen3.6-27B at Q8_0 using two GPUs (RTX 5090 and RTX 3090 Ti). Switching from layer split to tensor split mode increased throughput from 70 to 100 t/s, with a 70/30 tensor split favoring the 5090 to match compute power. Throughput varies by prompt, reaching up to 130 t/s in some cases.

media r/LocalLLaMA · 2d ago

Who needs GPUs? 64 t/s gen, 285 PP on 6-year-old CPUs

A gemma-4-26B-A4B model running on CPU-only with two Xeon 6248R processors achieves 64 tokens per second generation and 285 parallel processing, demonstrating viable performance on 6-year-old hardware. The user highlights the potential for CPU-optimized local LLMs to rival GPU-based systems, emphasizing cost efficiency and accessibility.

github llama.cpp · 2d ago

llama.cpp release b9767 adds GPU and multi-platform support

llama.cpp release b9767 improves MTP inference using mat-vec paths for small batches and includes updated GPU support. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 2d ago

MCP servers consume context window via tool definitions

Each MCP server dumps its full tool list into the model's context before any prompt, using up to 24,000 tokens for 62 tools. A local gateway implementing lazy discovery reduces tool-definition overhead by 97%, cutting token usage from ~24k to ~660 per request, with 90% fewer total tokens over a task, without affecting task success rate.

github llama.cpp · 2d ago

llama.cpp Release b9763 Adds ID to Tool Call Responses

llama.cpp version b9763 introduces an ID field in tool call responses. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, with a UI component also available.

media r/LocalLLaMA · 2d ago

Idea for running GLM2 at decent quant with GPU and DDR3 setup

The user proposes using four 5060 Ti GPUs with 64GB VRAM total, running at PCIe Gen 3, to run GLM2 at a reasonable quantization level. They suggest adding 512GB of DDR3 RAM in a server with 16 PCIe lanes and 4x4 bifurcation to offload KV cache storage, aiming for efficient inference without relying on unified memory clusters. The setup is estimated to cost around $1700 total, with potential viability for GLM2 at a decent quant level.

lab NVIDIA Technical Blog · 3d ago

CCCL Runtime: A Modern C++ Runtime for CUDA

NVIDIA has released the CCCL Runtime, a modern C++ runtime that provides safer and more convenient abstractions for CUDA programming. It introduces updated C++ features to simplify and enhance CUDA C++ development.

lab NVIDIA Technical Blog · 3d ago

Enable Real-Time AI for High-Speed Data Acquisition with DAQIRI

AlphaFold2's 2020 success relied on 170,000 protein structures from the Protein Data Bank. Nvidia's DAQIRI enables real-time AI processing for high-speed data acquisition by analyzing data as it is generated.

media r/LocalLLaMA · 3d ago

GLM-5.2 UD-IQ1_M Speed Test on llama.cpp with 5090 and 3090 Ti

A speed test of GLM-5.2 quantized to UD-IQ1_M using llama.cpp shows 579 t/s prefill at 8k context and 324 t/s at 57k context. Decode speed remains steady at 10.6 t/s for over 580 tokens, dropping to 9.37 t/s at 60k context.

media r/LocalLLaMA · 3d ago

Qwen3.6-35B-A3B APEX on RTX 3090: Speed and Quality Benchmarks

A benchmark compares llama.cpp forks (ik_llama and spiritbuun) running Qwen3.6-35B-A3B APEX with I-Compact and I-Quality models. ik_llama with I-Compact achieves highest speed (~146 TPS), while spiritbuun with I-Quality and turbo8/turbo4 cache matches this speed and offers slightly better HellaSwag performance. turbo8/turbo4 KV caches outperform q8_0/q5_0, especially at longer contexts, with up to 15% speed gain and lower KLD, making them superior for quality and context length.

media MarkTechPost · 3d ago

MoonMath AI Open-Sources HIP Attention Kernel That Beats AITER v3 on MI300X

MoonMath AI has open-sourced a bf16 forward attention kernel for AMD's MI300X GPU, written in HIP rather than assembly. It outperforms AMD's own AITER v3 kernel across all tested shapes and rounding modes, with speedups up to 1.26x, and maintains bit-identical numerical accuracy.

media Hugging Face Forums · 3d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

media r/LocalLLaMA · 3d ago

QAT KV Cache Quantization for Gemma 4 31B Shows Massive Improvement

QAT KV cache quantization for Gemma 4 31B significantly reduces KL divergence compared to standard quants. QAT q8_0 achieves a worst-case divergence of 1.5, outperforming standard q4_0 by a factor of about 38, and QAT q4_0 surpasses standard q8_0 in performance, with much lower output drift and no catastrophic outliers.

media r/LocalLLaMA · 3d ago

Gemma 4 QAT 31B responds better to KV cache quantization

A benchmark shows that Gemma 4 QAT 31B performs better with KV cache quantization compared to previous versions. The results were derived from a post on the LocalLLaMA subreddit, where user justicecurcian shared performance data.

media r/LocalLLaMA · 3d ago

Local LLM Inference Optimization: The Complete Guide

A comprehensive guide to optimizing local LLM inference covers VRAM management, KV cache, MoE placement, MTP, CPU tuning, and common out-of-memory issues. The guide is available at https://carteakey.dev/blog/local-inference/local-llm-optimization/ and includes feedback requests from the author.

media r/LocalLLaMA · 4d ago

I forked ik_llama.cpp and added --numa mirror mode

A new fork of ik_llama.cpp adds a --numa mirror mode that duplicates model weights and KV cache across CPU sockets, enabling full utilization of multi-socket systems. This reduces remote memory access penalties and improves inference throughput by up to 1.6x on tested models, though it requires twice the RAM.

media r/LocalLLaMA · 4d ago

2× Radeon R9700 with Qwen 3.6 27B Q8 MTP on llama.cpp

A user reports running Qwen 3.6 27B MTP model on two Radeon R9700 GPUs via llama.cpp with ROCm 7.2.1. Tests show stable decode speeds (40–67 t/s) and prefill throughput (up to 1,500 t/s for prompts under 10k tokens), with MTP draft acceptance rates between 0.33 and 0.61.

media r/LocalLLaMA · 4d ago

ROCm vs Vulkan vs vLLM Performance on Dual R9700s

Tests show vLLM achieves significantly higher generation speeds on Qwen3.6 models, with 35B-A3B reaching 156 t/s using ROCm and AITER. ROCm outperforms Vulkan in both 35B and 27B models, with speeds of ~106 t/s and ~44 t/s respectively, while Vulkan achieves ~87 t/s and ~41 t/s.

media r/LocalLLaMA · 4d ago

Why is AutoRound being slept on so hard?

AutoRound significantly outperforms standard AWQ and RTN in perplexity and accuracy, especially for complex reasoning and long contexts. It natively exports to GGUF, bypassing conversion issues, and runs on any PyTorch setup, yet remains underused despite these advantages.

media r/LocalLLaMA · 4d ago

Gemma 4 QAT responds better to KV cache quantization

A Reddit post reports that Gemma 4 QAT shows significant improvement in performance when using KV cache quantization, as measured on the wikitext dataset with 16k context. The user notes their hardware limits testing 31B models and invites others to explore the results.