Inference efficiency — korshunov.ai

Inference efficiency Page 6 / 10

llama.cpp release b9718: consolidated slot selection and new binary builds

llama.cpp version b9718 consolidates slot selection into a single function, get_available_slot, while maintaining LCP similarity checks for prompt cache updates. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 6d ago

llama.cpp Release b9721 Available for Multiple Platforms

llama.cpp has released version b9721, offering binaries for macOS, Linux, Android, Windows, and openEuler across various architectures. The release includes CPU, Vulkan, ROCm, OpenVINO, SYCL, and HIP support, with a dedicated UI package. A feature for Apple Silicon with KleidiAI is currently disabled.

media r/LocalLLaMA · 6d ago

GLM-5.2 can now run locally in llama.cpp and Unsloth Studio

GLM-5.2, the strongest open model to date, can now run locally using llama.cpp and Unsloth Studio. The 2-bit quantized model retains ~82% accuracy after reducing size from 1.51TB to 238GB, a 84% reduction, and is compatible with 256GB RAM or VRAM setups.

github llama.cpp · 7d ago

LLaMA.cpp Release b9715 Adds CUDA Col2Im 1D and Multiple Platform Binaries

LLaMA.cpp version b9715 introduces CUDA support for GGML_OP_COL2IM_1D, building on a CPU implementation. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and acceleration frameworks, including Vulkan, ROCm, OpenVINO, and SYCL.

arxiv arXiv cs.AI · 7d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.LG · 7d ago

HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction

HEPTv2 achieves 98.6% tracking efficiency with 0.8% fake rate on TrackML, using only 15 ms inference time and 0.4 GB memory per event. It outperforms prior transformer and graph-based methods in efficiency and reduces latency by factors of 7 and 38–52, respectively, enabling real-time particle reconstruction at the HL-LHC.

arxiv arXiv cs.LG · 7d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.

arxiv arXiv cs.LG · 7d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.LG · 7d ago

StreamKL: Fast and Memory-Efficient KL Divergence for Attention Distillation

StreamKL introduces a fused GPU primitive that eliminates quadratic memory usage in attention distillation by streaming query-key tiles through on-chip SRAM. It achieves up to 43x speedup in forward and 14x in backward passes, reducing extra HBM footprint from O(N_QN_K) to O(1), enabling long-context distillation on a single GPU.

arxiv arXiv cs.LG · 7d ago

AD-DeepONet for Fast Bridge Response Prediction

An adaptive-trunk DeepONet framework predicts localized structural responses in long-span bridges with high accuracy. By using distance-aware features and a stiffness-informed Schur complement, it achieves FEM-level accuracy with less than 5% error, reducing total response evaluation time by 60x and inference speed by up to four orders of magnitude compared to finite element methods.

arxiv arXiv cs.CL · 7d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 7d ago

Token-Operations-Oriented Inference Optimization Techniques

This paper introduces a four-layer technical architecture for token-oriented inference optimization, including Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It reviews key technologies and industry status, analyzing their real-world application value in reducing token costs, enhancing service efficiency, and ensuring stable token supply.

media r/LocalLLaMA · 7d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media r/LocalLLaMA · 7d ago

DiffusionGemma 26B on 4090 reaches 475t/s with limitations

DiffusionGemma 26B runs at up to 475t/s on a 4090 via vLLM with INT4 AWQ quantization, achieving speeds between 290t/s and 700t/s based on output length. However, it suffers from single-user operation, lower response accuracy, rapid context loss, and slower time-to-first-token compared to standard 26B models.

media r/LocalLLaMA · 7d ago

Running GLM-5.2 on CPU Only with Local Setup

A user runs GLM-5.2 locally on a Dell PowerEdge R740 with dual Xeon 6248R CPUs and 768GB RAM, using ik_llama.cpp for improved CPU inference. After isolating one NUMA node for optimal performance, they achieve 4–5.5 tokens per second in chat and about 3 tokens per second in coding tasks, noting the model shows 'frontier vibes' during code generation despite limited usability on this hardware.

media r/LocalLLaMA · 7d ago

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

media r/LocalLLaMA · 7d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media Latent Space · 7d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 7d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 7d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.