Open weights — korshunov.ai

Open weights Page 1 / 11

Gemma4-26B-A4B & 31B-QAT Uncensored Balanced Released with MTP Speed Boosts

HauhauCS has released two new uncensored, balanced versions of the Gemma 4 models: Gemma4-26B-A4B and Gemma4-31B-QAT. Both variants incorporate Multi-Token Prediction (MTP) draft heads to enable speculative decoding, resulting in significant inference speed improvements. The 26B-A4B model achieves approximately a 35% speed boost, while the 31B model sees a 53% increase, with identical output quality verified by the model's drafting mechanism. These releases utilize QAT-aware quantization, making Q4_K_M the optimal format as higher precision offers no quality gains for these specific models. The 26B-A4B is a Mixture of Experts architecture with roughly 4 billion active parameters per token, whereas the 31B variant is a dense model offering higher capability for users with sufficient VRAM. Both models include vision support via mmproj files and maintain a 262K context window. The author notes that GenRM testing resulted in zero refusals across 465 prompts, confirming their uncensored nature.

media r/LocalLLaMA · 4h ago

GLM-5.2 on 4x DGX Spark: Reconstructing Missing Build Steps for MTP Speculative Decode

The author successfully deployed GLM-5.2 with MTP speculative decode on a cluster of four NVIDIA GB10 (DGX Spark) nodes, achieving approximately 9.4 tokens per second. This setup utilizes vLLM with tensor parallelism, ported sparse-MLA Triton kernels, and a deterministic 15% expert pruning to fit AWQ-INT4 weights. A critical finding is that the original Docker image build instructions are incomplete, requiring reconstruction of missing patches for deep_gemm.py and sparse_attn_indexer.py. The author also identified that using any vLLM version other than the specific pinned commit causes real AWQ weights to crash during loading due to CUDA errors. To replicate the environment, users must apply a custom script that bakes in kernels and routes functions to sm12x fallbacks. Performance benefits include roughly double the speed of previous llama.cpp implementations, though inter-node bandwidth remains a bottleneck for dual-rail scaling.

media r/LocalLLaMA · 8h ago

SDXL Running Locally in Browser on WebGPU, Open-Source

A browser extension enables local image generation using SDXL models via WebGPU, running on the user's GPU without external setups. The tool supports two models: SDXL-Lighting fp16 (7 GB) and a 4-bit version (3.6 GB), with requirements including at least 8 GB VRAM for the full model and a browser with WebGPU support (Chrome/Edge 122+ or latest Firefox).

github llama.cpp · 9h ago

llama.cpp releases b9782 with new binaries and support

llama.cpp releases version b9782, including binaries for macOS, Linux, Android, Windows, and openEuler. The release adds support for Vulkan, OpenVINO, SYCL, ROCm, and CUDA across multiple architectures, with updated UI and disabled features such as KleidiAI and openEuler support.

media r/LocalLLaMA · 10h ago

Sipp: Open-source library for in-browser inference built on llama.cpp

Sipp is an open-source library that enables in-browser inference using llama.cpp. It allows users to run local language model inference directly in web browsers without relying on cloud services. The project is available on GitHub at https://github.com/noumena-labs/Sipp.

arxiv arXiv cs.AI · 11h ago

SciVerseGym: Reinforcement Learning Environment for Crystal Discovery

SciVerseGym introduces a Gymnasium-compatible environment that frames crystal discovery as a Markov decision process. It enables agents to perform chemically meaningful edits on atomic structures and receive feedback from configurable evaluators, supporting diverse actions and observation types with machine-learned potentials or ASE-compatible calculators.

media r/LocalLLaMA · 11h ago

Build a LLM from Scratch using MLX

A developer created a Nano LLM with 20.2M parameters on a MacBook Air using the MLX framework. The project demonstrates that building a large language model from scratch is feasible with minimal hardware and basic Python knowledge.

github llama.cpp · 12h ago

llama.cpp releases b9781 with Vulkan and multi-platform support

llama.cpp releases version b9781, adding Vulkan support for Linux and Windows, and expanding to multiple architectures including ARM64 and x64 across macOS, Linux, Android, and Windows. The release includes CPU, CUDA, OpenVINO, SYCL, and ROCm builds, with a UI component available.

media r/LocalLLaMA · 13h ago

Model hacks boost GLM5.2 speed from 2.5 to over 50 tok/s

A user achieved over 50 tokens per second for GLM5.2 on their GH200 system by combining the MTP head from zai's FP8 repo with CyanKiwi's AWQ-INT4 quantized model. This hybrid approach, implemented via a merge script and patched vLLM, reached a best case of ~55 tok/sec at 4x concurrency and ~45 tok/sec for single inference, with streaming from RAM to VRAM.

media Hugging Face Forums · 13h ago

Aiden Mobile Agent Prototype in the Making

Aiden is a physical AI agent device that monitors a phone's screen via HDMI and controls it through USB HID, enabling app automation without jailbreak or installed software. It supports bring-your-own LLMs, operates without backend infrastructure or data collection, and is released under the AGPL license as an open-source development board.

media r/LocalLLaMA · 14h ago

Nex-N2-Mini-Ultra-Uncensored-Heretic Model Released

The Nex-N2-Mini-Ultra-Uncensored-Heretic model is now available, featuring agentic thinking with 5/100 refusals and a KLD of 0.0020. It is released in both Safetensors and GGUF formats and is accessible via Hugging Face. The creator notes that Heretic 1.2.0 was chosen over 1.4.0 due to better performance in avoiding high KLD and maintaining low refusal thresholds.

media r/LocalLLaMA · 15h ago

What tools do people use to estimate VRAM and RAM for local LLMs?

Users share that hf-accelerate's model-memory-usage and NyxKrage's LLM VRAM Calculator are common tools for estimating VRAM and RAM needs. The NyxKrage tool is noted for being KV-cache-aware and configurable with quantization and context length settings, though results may vary across models and engines like llama.cpp or vLLM due to quantization and caching behaviors.

media r/LocalLLaMA · 17h ago

llama.cpp updates: Granite-Speech, LFM2.5-ColBERT models, Vulkan backend enhancements

llama.cpp now supports granite-speech-4.1-2b-plus and LFM2.5-ColBERT/Embedding-350M models. Vulkan backend updates include support for 3D convolutions, aligned operations, GET_ROWS_BACK, and improved numerical stability in feedforward layers. Additional improvements cover UI enhancements and backend test coverage.

github llama.cpp · 18h ago

LLaMA.cpp Release b9777 Adds New Models and Cross-Platform Binaries

LLaMA.cpp release b9777 adds LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M models. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration technologies like CUDA, Vulkan, OpenVINO, and SYCL.

media r/LocalLLaMA · 19h ago

New EU AI Model Domyn Will Be 400B Parameters

A startup has developed a closed 260B-parameter Domyn Large model for enterprise use and an open 10B model available on HuggingFace. The company announces a new EU AI model, Domyn, which will scale to 400 billion parameters.

github llama.cpp · 23h ago

llama.cpp release b9776 adds Vulkan and multiple hardware support

llama.cpp version b9776 introduces Vulkan support for Linux and Windows, along with CPU, OpenCL, CUDA, and SYCL variants across macOS, Linux, Android, and Windows. The release also includes support for OpenVINO and ROCm, with UI available in a standalone package.

arxiv arXiv cs.CL · 23h ago

POS Tagging of Arabic-English Dictionary Senses via WordNet

The paper presents an algorithm that transfers English part-of-speech tags from Princeton WordNet to Arabic-English dictionary senses after disambiguation. This enables linking bilingual dictionaries to WordNet and standardizing them into WordNet-LMF format, where synsets are the fundamental unit, with high accuracy at low cost.

arxiv arXiv cs.CL · 23h ago

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

ComputeFHE is an open-source C++ library that enables privacy-preserving computation using the TFHE cryptosystem. It offers encrypted integer and fixed-point data types with arithmetic and logical operations, supporting both standard and optimized FHE-friendly ALU architectures. Experimental results show up to 3.9x performance improvements and reduced bootstrapping operations, with a simulation mode for testing and complexity analysis without cryptographic execution.

arxiv arXiv cs.CL · 1d ago

African Language Tokenization Penalty in Frontier LLMs

African languages face a tokenization premium of 1.88x to 8.92x compared to English in frontier LLMs, with Ethiopic and N'Ko scripts bearing the highest costs. This penalty translates to up to 8.9x higher inference costs and reduced context capacity, with some languages receiving as little as 11% of English's effective context window. The penalty persists across corpora and is not eliminated by current tokenizers, highlighting a structural digital divide.

arxiv arXiv cs.CL · 1d ago

RaDaR: AI Model Improves Rare Disease Diagnosis

RaDaR, a compact reasoning large language model, outperformed other open-source models in rare disease diagnosis. In a randomized trial, RaDaR improved physicians' diagnostic accuracy by 21.44 percentage points over internet search alone.