Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 10

Idea for running GLM2 at decent quant with GPU and DDR3 setup

The user proposes using four 5060 Ti GPUs with 64GB VRAM total, running at PCIe Gen 3, to run GLM2 at a reasonable quantization level. They suggest adding 512GB of DDR3 RAM in a server with 16 PCIe lanes and 4x4 bifurcation to offload KV cache storage, aiming for efficient inference without relying on unified memory clusters. The setup is estimated to cost around $1700 total, with potential viability for GLM2 at a decent quant level.

lab NVIDIA Technical Blog · 4d ago

CCCL Runtime: A Modern C++ Runtime for CUDA

NVIDIA has released the CCCL Runtime, a modern C++ runtime that provides safer and more convenient abstractions for CUDA programming. It introduces updated C++ features to simplify and enhance CUDA C++ development.

lab NVIDIA Technical Blog · 4d ago

Enable Real-Time AI for High-Speed Data Acquisition with DAQIRI

AlphaFold2's 2020 success relied on 170,000 protein structures from the Protein Data Bank. Nvidia's DAQIRI enables real-time AI processing for high-speed data acquisition by analyzing data as it is generated.

media r/LocalLLaMA · 4d ago

GLM-5.2 UD-IQ1_M Speed Test on llama.cpp with 5090 and 3090 Ti

A speed test of GLM-5.2 quantized to UD-IQ1_M using llama.cpp shows 579 t/s prefill at 8k context and 324 t/s at 57k context. Decode speed remains steady at 10.6 t/s for over 580 tokens, dropping to 9.37 t/s at 60k context.

media r/LocalLLaMA · 4d ago

Qwen3.6-35B-A3B APEX on RTX 3090: Speed and Quality Benchmarks

A benchmark compares llama.cpp forks (ik_llama and spiritbuun) running Qwen3.6-35B-A3B APEX with I-Compact and I-Quality models. ik_llama with I-Compact achieves highest speed (~146 TPS), while spiritbuun with I-Quality and turbo8/turbo4 cache matches this speed and offers slightly better HellaSwag performance. turbo8/turbo4 KV caches outperform q8_0/q5_0, especially at longer contexts, with up to 15% speed gain and lower KLD, making them superior for quality and context length.

media MarkTechPost · 4d ago

MoonMath AI Open-Sources HIP Attention Kernel That Beats AITER v3 on MI300X

MoonMath AI has open-sourced a bf16 forward attention kernel for AMD's MI300X GPU, written in HIP rather than assembly. It outperforms AMD's own AITER v3 kernel across all tested shapes and rounding modes, with speedups up to 1.26x, and maintains bit-identical numerical accuracy.

media Hugging Face Forums · 4d ago

I built a novel triple-hybrid LLM under 1B parameters for ~$50

Mateusz has developed a full pre-trained language model, Project Inkblot's Titan v1, combining Mamba SSM, Multi-Head Attention, and 32-expert MoE in a single decoder-only architecture under 1B parameters. The model, trained on a single NVIDIA L4 GPU for ~$50, achieves 27.5 validation perplexity and demonstrates efficient scaling via a single-line config update, with all components implemented from scratch in PyTorch. Titan v2's first training cycle is now complete, and dataset expansion is underway.

media r/LocalLLaMA · 4d ago

QAT KV Cache Quantization for Gemma 4 31B Shows Massive Improvement

QAT KV cache quantization for Gemma 4 31B significantly reduces KL divergence compared to standard quants. QAT q8_0 achieves a worst-case divergence of 1.5, outperforming standard q4_0 by a factor of about 38, and QAT q4_0 surpasses standard q8_0 in performance, with much lower output drift and no catastrophic outliers.

media r/LocalLLaMA · 4d ago

Gemma 4 QAT 31B responds better to KV cache quantization

A benchmark shows that Gemma 4 QAT 31B performs better with KV cache quantization compared to previous versions. The results were derived from a post on the LocalLLaMA subreddit, where user justicecurcian shared performance data.

media r/LocalLLaMA · 4d ago

Local LLM Inference Optimization: The Complete Guide

A comprehensive guide to optimizing local LLM inference covers VRAM management, KV cache, MoE placement, MTP, CPU tuning, and common out-of-memory issues. The guide is available at https://carteakey.dev/blog/local-inference/local-llm-optimization/ and includes feedback requests from the author.

media r/LocalLLaMA · 5d ago

I forked ik_llama.cpp and added --numa mirror mode

A new fork of ik_llama.cpp adds a --numa mirror mode that duplicates model weights and KV cache across CPU sockets, enabling full utilization of multi-socket systems. This reduces remote memory access penalties and improves inference throughput by up to 1.6x on tested models, though it requires twice the RAM.

media r/LocalLLaMA · 5d ago

2× Radeon R9700 with Qwen 3.6 27B Q8 MTP on llama.cpp

A user reports running Qwen 3.6 27B MTP model on two Radeon R9700 GPUs via llama.cpp with ROCm 7.2.1. Tests show stable decode speeds (40–67 t/s) and prefill throughput (up to 1,500 t/s for prompts under 10k tokens), with MTP draft acceptance rates between 0.33 and 0.61.

media r/LocalLLaMA · 5d ago

ROCm vs Vulkan vs vLLM Performance on Dual R9700s

Tests show vLLM achieves significantly higher generation speeds on Qwen3.6 models, with 35B-A3B reaching 156 t/s using ROCm and AITER. ROCm outperforms Vulkan in both 35B and 27B models, with speeds of ~106 t/s and ~44 t/s respectively, while Vulkan achieves ~87 t/s and ~41 t/s.

media r/LocalLLaMA · 5d ago

Why is AutoRound being slept on so hard?

AutoRound significantly outperforms standard AWQ and RTN in perplexity and accuracy, especially for complex reasoning and long contexts. It natively exports to GGUF, bypassing conversion issues, and runs on any PyTorch setup, yet remains underused despite these advantages.

media r/LocalLLaMA · 5d ago

Gemma 4 QAT responds better to KV cache quantization

A Reddit post reports that Gemma 4 QAT shows significant improvement in performance when using KV cache quantization, as measured on the wikitext dataset with 16k context. The user notes their hardware limits testing 31B models and invites others to explore the results.

media r/LocalLLaMA · 6d ago

GLM 5.2 Local Inference Speeds Report

Users reporting local GLM 5.2 inference speeds using llama.cpp on 6x RTX 3090 with 128GB DDR5 and an i7-13700K achieve 7.8 tokens/sec at 90K context size with Q8_0 quantization. Prompt processing occurs at approximately 40 tokens/sec.

github llama.cpp · 6d ago

llama.cpp Release b9741 Adds New Binaries and Support

llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.

media r/LocalLLaMA · 6d ago

Free 15-Part Series on LLM Internals Grounded in Gemma 4 12B

I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.

github llama.cpp · 6d ago

Fix for test-args-parser random failures on Windows

A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.

media r/LocalLLaMA · 6d ago

You can now convert EXL3 quants on Apple Silicon Mac

Users can now convert and run EXL3 quantized models on Apple Silicon Macs with 64GB+ RAM. Tests show that models like MiniCPM5 and Qwen3.6-27B achieve performance on par with or slightly behind RTX-card-based conversions, with EXL3 offering superior quantization quality compared to MLX.