Open weights
media r/LocalLLaMA · 2h ago

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced Released with MTP Speed Boosts

HauhauCS has released two new uncensored, balanced versions of the Gemma 4 models: Gemma4-26B-A4B and Gemma4-31B-QAT. Both variants incorporate Multi-Token Prediction (MTP) draft heads to enable speculative decoding, resulting in significant inference speed improvements. The 26B-A4B model achieves approximately a 35% speed boost, while the 31B model sees a 53% increase, with identical output quality verified by the model's drafting mechanism. These releases utilize QAT-aware quantization, making Q4_K_M the optimal format as higher precision offers no quality gains for these specific models. The 26B-A4B is a Mixture of Experts architecture with roughly 4 billion active parameters per token, whereas the 31B variant is a dense model offering higher capability for users with sufficient VRAM. Both models include vision support via mmproj files and maintain a 262K context window. The author notes that GenRM testing resulted in zero refusals across 465 prompts, confirming their uncensored nature.

media r/LocalLLaMA · 4h ago

GLM-5.2 on 4x DGX Spark: Reconstructing Missing Build Steps for MTP Speculative Decode

The author successfully deployed GLM-5.2 with MTP speculative decode on a cluster of four NVIDIA GB10 (DGX Spark) nodes, achieving approximately 9.4 tokens per second. This setup utilizes vLLM with tensor parallelism, ported sparse-MLA Triton kernels, and a deterministic 15% expert pruning to fit AWQ-INT4 weights. A critical finding is that the original Docker image build instructions are incomplete, requiring reconstruction of missing patches for deep_gemm.py and sparse_attn_indexer.py. The author also identified that using any vLLM version other than the specific pinned commit causes real AWQ weights to crash during loading due to CUDA errors. To replicate the environment, users must apply a custom script that bakes in kernels and routes functions to sm12x fallbacks. Performance benefits include roughly double the speed of previous llama.cpp implementations, though inter-node bandwidth remains a bottleneck for dual-rail scaling.

arxiv arXiv cs.CL · 23h ago

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

ComputeFHE is an open-source C++ library that enables privacy-preserving computation using the TFHE cryptosystem. It offers encrypted integer and fixed-point data types with arithmetic and logical operations, supporting both standard and optimized FHE-friendly ALU architectures. Experimental results show up to 3.9x performance improvements and reduced bootstrapping operations, with a simulation mode for testing and complexity analysis without cryptographic execution.

arxiv arXiv cs.CL · 1d ago

African Language Tokenization Penalty in Frontier LLMs

African languages face a tokenization premium of 1.88x to 8.92x compared to English in frontier LLMs, with Ethiopic and N'Ko scripts bearing the highest costs. This penalty translates to up to 8.9x higher inference costs and reduced context capacity, with some languages receiving as little as 11% of English's effective context window. The penalty persists across corpora and is not eliminated by current tokenizers, highlighting a structural digital divide.