Inference efficiency — korshunov.ai

Inference efficiency Page 1 / 9

BITEMBED: Extreme Low-Bit Framework for LLM-Based Text Embeddings

The paper introduces BITEMBED, an extreme low-bit framework designed to address the high deployment costs of LLM-based text embedders by targeting both encoding efficiency and vector storage. The method converts pretrained LLM backbones into BitNet-style encoders featuring ternary weights, quantized activations, and lightweight normalization refinement. To adapt these models for representation learning, BITEMBED employs continual contrastive pre-training followed by supervised contrastive fine-tuning. This fine-tuning process utilizes similarity-distribution distillation and attention-relation distillation from a full-precision teacher model. Beyond backbone quantization, the framework trains output embeddings to support multiple storage precisions, allowing for flexible trade-offs between performance and storage costs. Experiments on the MMTEB benchmark using Qwen3-0.6B and Gemma3-270M demonstrate that BITEMBED performs largely comparably to full-precision teacher embedders.

github llama.cpp · 1h ago Live

llama.cpp b9785 Release with Hardened Caps Check and Multi-Platform Binaries

The llama.cpp project has released version b9785, featuring a code change to harden caps checks as detailed in pull request #24973. This update provides pre-built binaries for macOS Apple Silicon, Intel Macs, and iOS via an XCFramework, with KleidiAI support disabled on Apple Silicon. Linux distributions including Ubuntu are supported for CPU, Vulkan, ROCm 7.2, OpenVINO, and SYCL backends across x64, arm64, and s390x architectures. Android users can access arm64 CPU binaries, while Windows offers extensive options covering CPU, OpenCL Adreno, CUDA 12 and 13, Vulkan, OpenVINO, SYCL, and HIP. The release also includes builds for openEuler targeting x86 and aarch64 processors with ACL Graph support. A standalone UI package is available alongside the platform-specific releases to facilitate local model inference.

media r/LocalLLaMA · 2h ago

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced Released with MTP Speed Boosts

HauhauCS has released two new uncensored, balanced versions of the Gemma 4 models: Gemma4-26B-A4B and Gemma4-31B-QAT. Both variants incorporate Multi-Token Prediction (MTP) draft heads to enable speculative decoding, resulting in significant inference speed improvements. The 26B-A4B model achieves approximately a 35% speed boost, while the 31B model sees a 53% increase, with identical output quality verified by the model's drafting mechanism. These releases utilize QAT-aware quantization, making Q4_K_M the optimal format as higher precision offers no quality gains for these specific models. The 26B-A4B is a Mixture of Experts architecture with roughly 4 billion active parameters per token, whereas the 31B variant is a dense model offering higher capability for users with sufficient VRAM. Both models include vision support via mmproj files and maintain a 262K context window. The author notes that GenRM testing resulted in zero refusals across 465 prompts, confirming their uncensored nature.

media r/LocalLLaMA · 4h ago

GLM-5.2 on 4x DGX Spark: Reconstructing Missing Build Steps for MTP Speculative Decode

The author successfully deployed GLM-5.2 with MTP speculative decode on a cluster of four NVIDIA GB10 (DGX Spark) nodes, achieving approximately 9.4 tokens per second. This setup utilizes vLLM with tensor parallelism, ported sparse-MLA Triton kernels, and a deterministic 15% expert pruning to fit AWQ-INT4 weights. A critical finding is that the original Docker image build instructions are incomplete, requiring reconstruction of missing patches for deep_gemm.py and sparse_attn_indexer.py. The author also identified that using any vLLM version other than the specific pinned commit causes real AWQ weights to crash during loading due to CUDA errors. To replicate the environment, users must apply a custom script that bakes in kernels and routes functions to sm12x fallbacks. Performance benefits include roughly double the speed of previous llama.cpp implementations, though inter-node bandwidth remains a bottleneck for dual-rail scaling.

media r/LocalLLaMA · 4h ago

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

Gefen is presented as a drop-in replacement for the AdamW optimizer, claiming an eightfold reduction in memory usage during training. The project includes a GitHub repository available at ndvbd/Gefen and a corresponding research paper hosted on arXiv under the identifier 2606.13894. This submission highlights Gefen's potential to optimize resource efficiency for machine learning workflows. The provided source material links directly to the technical documentation and codebase for further verification. No additional performance metrics or comparative benchmarks are detailed in the available text.

media Hugging Face Forums · 5h ago

Qwen3/Gemma3 Candle Skips Attention Masks for Equal-Length Batches in CPU Mode

A user has reported a critical bug in the Hugging Face text-embeddings-inference library affecting Qwen3 and Gemma3 models. The issue arises when running inference on CPUs with concurrent requests, leading to significant accuracy degradation. Specifically, the Candle backend incorrectly skips attention masks for batches where all input sequences have equal lengths. This defect compromises the reliability of embeddings generated under these specific conditions. To address the problem, the author submitted a pull request containing a fix that was thoroughly tested on their local machines. The bug highlights potential stability risks in CPU-based embedding services handling batched inputs.

github llama.cpp · 7h ago

LLaMA.cpp Release b9784: Hexagon MM Optimizations and Cross-Platform Binaries

LLaMA.cpp releases version b9784 with major optimizations for hexagon-based MM operations, including 32x32 tiled weight repack, improved dyn.quant handling, and unified kernel parameter management. The release includes new binaries for macOS (arm64 and x64), iOS, and multiple Linux architectures with support for Vulkan, ROCm, and OpenVINO.

github llama.cpp · 9h ago

llama.cpp releases b9782 with new binaries and support

llama.cpp releases version b9782, including binaries for macOS, Linux, Android, Windows, and openEuler. The release adds support for Vulkan, OpenVINO, SYCL, ROCm, and CUDA across multiple architectures, with updated UI and disabled features such as KleidiAI and openEuler support.

lab Hugging Face Blog · 11h ago

NVIDIA NeMo AutoModel Speeds Up Transformer Fine-Tuning

NVIDIA's NeMo AutoModel enables faster fine-tuning of transformer models by automating model selection and optimization. It reduces development time and improves efficiency in training large language models on NVIDIA hardware.

media r/LocalLLaMA · 11h ago

OpenAI and Broadcom Unveil LLM-Optimized Inference Chip

Early testing shows the first-generation chip delivers significantly better performance per watt than current state-of-the-art solutions. Built from the ground up for current and future large language models, the chip expands OpenAI's full-stack platform and will be deployed at gigawatt scale with data center partners across multiple generations.

media r/LocalLLaMA · 12h ago

Big News for AMD Strix Halo+ Owners: NPU Now Usable

AMD's NPU is now fully usable, enabling hybrid AI models on Strix Halo+ devices. Users can leverage hybrid mode to combine NPU and iGPU performance, with tools like Lemonade and official documentation making early testing accessible. The community is now calling for MTP-supported hybrid models to further boost performance.

github llama.cpp · 12h ago

llama.cpp releases b9781 with Vulkan and multi-platform support

llama.cpp releases version b9781, adding Vulkan support for Linux and Windows, and expanding to multiple architectures including ARM64 and x64 across macOS, Linux, Android, and Windows. The release includes CPU, CUDA, OpenVINO, SYCL, and ROCm builds, with a UI component available.

media r/LocalLLaMA · 13h ago

Model hacks boost GLM5.2 speed from 2.5 to over 50 tok/s

A user achieved over 50 tokens per second for GLM5.2 on their GH200 system by combining the MTP head from zai's FP8 repo with CyanKiwi's AWQ-INT4 quantized model. This hybrid approach, implemented via a merge script and patched vLLM, reached a best case of ~55 tok/sec at 4x concurrency and ~45 tok/sec for single inference, with streaming from RAM to VRAM.

lab OpenAI News · 14h ago

OpenAI and Broadcom unveil LLM-optimized inference chip

OpenAI and Broadcom have introduced Jalapeño, a custom AI chip designed for large language model inference. The chip aims to enhance performance, efficiency, and scalability in AI systems.

media r/LocalLLaMA · 15h ago

Gemma 4 26BA4B Surprisingly Usable at IQ3_S

A user reports that Gemma 4 26B quantized to Q3 runs at 25 tokens per second on a MacBook Air, performing nearly as well as bf16 for non-coding, tool-calling tasks. They question whether this performance reflects confirmation bias or if small quantized models are genuinely usable.

media r/LocalLLaMA · 15h ago

What tools do people use to estimate VRAM and RAM for local LLMs?

Users share that hf-accelerate's model-memory-usage and NyxKrage's LLM VRAM Calculator are common tools for estimating VRAM and RAM needs. The NyxKrage tool is noted for being KV-cache-aware and configurable with quantization and context length settings, though results may vary across models and engines like llama.cpp or vLLM due to quantization and caching behaviors.

media r/LocalLLaMA · 17h ago

llama.cpp updates: Granite-Speech, LFM2.5-ColBERT models, Vulkan backend enhancements

llama.cpp now supports granite-speech-4.1-2b-plus and LFM2.5-ColBERT/Embedding-350M models. Vulkan backend updates include support for 3D convolutions, aligned operations, GET_ROWS_BACK, and improved numerical stability in feedforward layers. Additional improvements cover UI enhancements and backend test coverage.

arxiv arXiv cs.LG · 17h ago

Reservoir Computing for Feature-Free Audio Signal Processing

This paper explores Reservoir Computing as a feature-free method for raw audio signal classification. It shows that parallel deep reservoir architectures outperform shallow and sequential ones in accuracy while maintaining low complexity, enabling efficient, low-power audio processing with minimal preprocessing.

github llama.cpp · 18h ago

LLaMA.cpp Release b9777 Adds New Models and Cross-Platform Binaries

LLaMA.cpp release b9777 adds LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M models. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration technologies like CUDA, Vulkan, OpenVINO, and SYCL.

arxiv arXiv cs.LG · 18h ago

Fast-TurboQuant: Multiplier-Free Vector Quantization

Fast-TurboQuant introduces a multiplier-free projection method using a structured fast Johnson-Lindenstrauss transform. It replaces dense random rotation matrices with Rademacher phase inversion and fast Walsh-Hadamard transform, reducing arithmetic to only additions and improving Recall@10 with lower mean squared error.