Inference efficiency — korshunov.ai

Inference efficiency Page 6 / 9

Repurposing an Old Multi-GPU Node for Local Inference

The node features 8 NVIDIA Quadro RTX 6000 GPUs with 192 GB VRAM and 512 GB RAM, enabling large-scale local AI model inference. Models like LLaMA-3 or Mistral with 8-13 billion parameters could run efficiently here, offering faster, private, and low-latency performance compared to single-GPU setups, making it worthwhile for internal use.

media r/LocalLLaMA · 6d ago

Calibrating 2-bit GGUFs for agentic coding tasks

2-bit quantized versions of Qwopus3.6-27B-Coder, calibrated on real agentic coding logs, achieve a 63% pass rate on SWE-rebench. The IQ2_M quant outperforms non-calibrated versions and rivals Q5_K_M in pass rate despite being half the size, with improved robustness to loops and faster decoding due to a bundled MTP.

media Latent Space · 6d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 6d ago

LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M Released

LFM2.5-Embedding-350M is a dense bi-encoder that provides fast multilingual retrieval with one vector per document, achieving best-in-class accuracy for its size and inference speed comparable to smaller models. LFM2.5-ColBERT-350M is a late interaction retriever with best-in-class multilingual accuracy, enabling cross-lingual retrieval by storing one vector per token and supporting retrieval in multiple languages with high precision. Both models are designed as drop-in replacements for existing RAG pipelines.

media r/LocalLLaMA · 6d ago

Real-world token cost savings from rtk, headroom, and caveman

A real workload analysis shows headroom, rtk, and caveman reduce token costs by 2.8%, 0.5%, and 0.4% respectively, totaling 3.7% of baseline spending. However, savings are limited by payload diversity, with most traffic being plain text or source code, and the tools only compress structured outputs. Most cost reduction occurs on the cheapest token stream—cache reads—while the tools do not affect prompt caching or output costs, and coverage gaps exist, especially for rtk.

github llama.cpp · 6d ago

llama.cpp Release b9703: Updates and Binary Downloads

llama.cpp version b9703 includes a rework of the server's preset handling, removing remote HF preset support and deprecated functions. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

github llama.cpp · 6d ago

llama.cpp release b9704: fixes invalid grammar handling and adds new binaries

llama.cpp version b9704 now returns HTTP 400 for invalid grammar instead of silently dropping constraints. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware accelerators, with support for Vulkan, ROCm, OpenVINO, SYCL, and CUDA.

media r/LocalLLaMA · 6d ago

unsloth GLM-5.2-GGUF with 2bit quantization at 238GB

The unsloth GLM-5.2-GGUF model is available with 2bit quantization, sized at 238GB. It is hosted on Hugging Face and shared via a Reddit post in the LocalLLaMA community.

media r/LocalLLaMA · 6d ago

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Qwen3.6-27B runs at ~60 tokens/sec on 32GB VRAM with FP8 KV quantization. NVFP4 kv cache quantization on SM120 could significantly enhance performance on such systems, though current implementation is not yet available.

media r/LocalLLaMA · 6d ago

Llama Bench vs Real-World Performance Discrepancy

The user reports a significant gap between Llama benchmark results and actual model performance. Benchmarks show 754 tk/s prefill and 36 tk/s generation, but real usage reveals only 7.98 tokens per second, with high latency and poor throughput. The discrepancy is attributed to real-world usage conditions, not benchmark settings, suggesting the model's actual performance is far below the benchmarked speed.

github llama.cpp · 6d ago

LLaMA.cpp Release b9698 Adds Self-Update Support and Multiple Platform Binaries

LLaMA.cpp version b9698 enables self-updates only when built with llama-install.sh. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

arxiv arXiv cs.LG · 7d ago

TransitNet Achieves 95.2% Accuracy in Low-SNR Transit Searches

TransitNet, a compact attention-augmented deep learning framework, achieves 95.2% accuracy in low-SNR transit blind searches, outperforming TLS and BLS in ROC-AUC and PR-AP values. It recovers 93.0% of injected Earth- and sub-Earth-size transits, with 97.4% of injected transits fully covered by estimated transit windows, and successfully recovers all 34 confirmed Kepler planets with a mean midpoint error of 1.24 hours.

arxiv arXiv cs.LG · 7d ago

EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts

EfficientRollout introduces a self-speculative decoding framework that reduces rollout and end-to-end latency by up to 19.6% and 12.7% respectively, without compromising final model quality. It uses a quantized drafter derived from the target model and integrates a system-aware toggle policy to avoid compute-bound regimes, enabling effective speculation during evolving policy generations.

arxiv arXiv cs.LG · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.LG · 7d ago

CAHP: Complementary Attention Head Pruning for Efficient Transformers

CAHP introduces a post-hoc framework that uses graph-theoretical clustering and information-theoretic measures to select complementary attention heads in Transformers. It automatically determines head retention without predefined sparsity, identifying a performance degradation threshold to ensure minimal model loss, and outperforms baselines in high-compression scenarios by preserving functionally critical heads in intermediate layers.

arxiv arXiv cs.AI · 7d ago

SwitchBraidNet: Lightweight EEG Model for Hybrid BCIs

SwitchBraidNet is a quantisation-aware EEG classification architecture that achieves high accuracy in motor imagery and SSVEP tasks. It outperforms four baselines in FP16 and FP32, with MI accuracy of 69.49%, SSVEP accuracy of 93.48%, and a hybrid information transfer rate of 64.82 bits/min in FP16. The model runs efficiently with only 3.03 KB of INT8 storage, enabling low-power embedded deployment.

arxiv arXiv cs.AI · 7d ago

TransitNet Achieves 95.2% Accuracy in Low-SNR Transit Searches

arxiv arXiv cs.AI · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over efficient baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

lab Claude Code Releases · 7d ago

Claude Code v2.1.181 Release Notes

Claude Code v2.1.181 introduces support for setting config settings via prompt syntax like /config thinking=false, adds sandbox Apple Events support on macOS, and improves streaming, auto-retry, and subagent behavior. It also fixes numerous bugs related to startup, file handling, clipboard, and UI responsiveness across platforms.

github llama.cpp · 7d ago

ggml-cpu: Conditionally enable POWER11 backend based on compiler support

The ggml-cpu project now conditionally enables the POWER11 backend in ggml based on compiler support for -mcpu=power11. This prevents build failures on current GCC/Clang toolchains while maintaining forward compatibility. Updates to CMakeLists.txt support this change, and -mcpu=power10 is used for both P10 and P11 architectures.