Inference efficiency — korshunov.ai

Inference efficiency Page 5 / 9

How to Set Optimal llama.cpp Parameters for AMD GPU

Users seeking optimal llama.cpp settings for gemma 4 models on an AMD GPU with 16GB VRAM ask whether trial and error is necessary. They reference Google's default settings for temperature, top-p, and top-k but note inconsistent results, indicating a need for more targeted guidance beyond official documentation.

media r/LocalLLaMA · 5d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.

github llama.cpp · 5d ago

LLaMA.cpp Release b9728 Adds Comment Line Support and Multiple Platform Binaries

LLaMA.cpp version b9728 introduces support for comment lines in --api-key-file configuration. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 5d ago

Best Harness for Web Searching

Users report that tools like LM Studio and Odysseus are limited by search engine request caps, often at 10 per day or hour, without API access. They suggest creating DuckDuckGo API accounts for better search access, but note that frontends rarely prompt for this. The post asks whether Hermes or Pi offer improved solutions.

media r/LocalLLaMA · 5d ago

Is My CPU and RAM Too Weak for Local LLMs?

A user reports their CPU and RAM are reaching 100% during simple test prompts, while the GPU is underutilized. They question whether their RTX 3050 8GB GPU can run Quen3.5:9b locally, noting that in theory it should be feasible.

github llama.cpp · 5d ago

llama.cpp release b9718: consolidated slot selection and new binary builds

llama.cpp version b9718 consolidates slot selection into a single function, get_available_slot, while maintaining LCP similarity checks for prompt cache updates. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 5d ago

llama.cpp Release b9721 Available for Multiple Platforms

llama.cpp has released version b9721, offering binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, ROCm, OpenVINO, SYCL, and HIP support, with a dedicated UI package. A feature for Apple Silicon with KleidiAI is currently disabled.

media r/LocalLLaMA · 5d ago

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio

GLM-5.2, the strongest open model to date, can now run locally using llama.cpp and Unsloth Studio. The 2-bit quantized model retains ~82% accuracy after reducing size from 1.51TB to 238GB, a 84% reduction, and is compatible with 256GB RAM or VRAM setups.

github llama.cpp · 6d ago

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.

arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.LG · 6d ago

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

HEPTv2 achieves 98.6% tracking efficiency with 0.8% fake rate on TrackML, using only 15 ms inference time and 0.4 GB memory per event. It outperforms prior transformer and graph-based methods in efficiency and reduces latency by factors of 7 and 38–52, respectively, enabling real-time particle reconstruction at the HL-LHC.

arxiv arXiv cs.LG · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.

arxiv arXiv cs.LG · 6d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.LG · 6d ago

StreamKL: Fast and Memory-Efficient KL Divergence for Attention Distillation

StreamKL introduces a fused GPU primitive that eliminates quadratic memory usage in attention distillation by streaming query-key tiles through on-chip SRAM. It achieves up to 43x speedup in forward and 14x in backward passes, reducing extra HBM footprint from O(N_QN_K) to O(1), enabling long-context distillation on a single GPU.

arxiv arXiv cs.LG · 6d ago

AD-DeepONet for Fast Bridge Response Prediction

An adaptive-trunk DeepONet framework predicts localized structural responses in long-span bridges with high accuracy. By using distance-aware features and a stiffness-informed Schur complement, it achieves FEM-level accuracy with less than 5% error, reducing total response evaluation time by 60x and inference speed by up to four orders of magnitude compared to finite element methods.

arxiv arXiv cs.CL · 6d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 6d ago

Token-Operations-Oriented Inference Optimization Techniques

This paper introduces a four-layer technical architecture for token-oriented inference optimization, including Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It reviews key technologies and industry status, analyzing their real-world application value in reducing token costs, enhancing service efficiency, and ensuring stable token supply.

media r/LocalLLaMA · 6d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media r/LocalLLaMA · 6d ago

DiffusionGemma 26B on 4090 reaches 475t/s with limitations

DiffusionGemma 26B runs at up to 475t/s on a 4090 via vLLM with INT4 AWQ quantization, achieving speeds between 290t/s and 700t/s based on output length. However, it suffers from single-user operation, lower response accuracy, rapid context loss, and slower time-to-first-token compared to standard 26B models.

media r/LocalLLaMA · 6d ago

Running GLM-5.2 on CPU Only with Local Setup

A user runs GLM-5.2 locally on a Dell PowerEdge R740 with dual Xeon 6248R CPUs and 768GB RAM, using ik_llama.cpp for improved CPU inference. After isolating one NUMA node for optimal performance, they achieve 4–5.5 tokens per second in chat and about 3 tokens per second in coding tasks, noting the model shows 'frontier vibes' during code generation despite limited usability on this hardware.