All articles — korshunov.ai

All articles Page 1 / 130

Stream-Stall Hint Updated in v2.1.185

The stream-stall hint now displays "Waiting for API response · will retry in …" and activates after 20 seconds of silence, replacing the previous message and delay.

media r/LocalLLaMA · 11d ago

Six Months Ago I Turned Down $8,165 for an RTX 6000 PRO

A Reddit user shared that six months ago they declined a $8,165 offer for an RTX 6000 PRO GPU. The same vendor now lists the same GPU for $11,575, prompting the user to reflect on their decision with hindsight.

media r/LocalLLaMA · 11d ago

GLM 5.2 Local Inference Speeds Report

Users reporting local GLM 5.2 inference speeds using llama.cpp on 6x RTX 3090 with 128GB DDR5 and an i7-13700K achieve 7.8 tokens/sec at 90K context size with Q8_0 quantization. Prompt processing occurs at approximately 40 tokens/sec.

github llama.cpp · 11d ago

llama.cpp Release b9741 Adds New Binaries and Support

llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.

media r/LocalLLaMA · 11d ago

Free 15-Part Series on LLM Internals Grounded in Gemma 4 12B

I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.

media r/LocalLLaMA · 11d ago

Qwen Code Companion Extension Now Open-Sourced

The Qwen Code Companion extension for VSCode is now available on the marketplace and has been open-sourced at https://github.com/QwenLM/qwen-code. The user reports it performs well with LM Studio-hosted models, outperforming other local LLM tools like continue, kilo, cline, and roo, with minimal configuration needed.

media r/LocalLLaMA · 11d ago

Gemma 4 26b a4b excels in language and scientific queries

A user claims Gemma 4 26b a4b is the best model they've tried for language learning and scientific queries, outperforming Qwen 3.5/3.6 in these domains. The post highlights a gap in available small MOE models between 20b and 30b, suggesting a need for more options beyond coding and agentic tasks.

media r/LocalLLaMA · 11d ago

Struggling to finish Xiaomi Mimo-v2.5-pro token plan credits before expiry

A user has 24B token credits from a Xiaomi token plan competition, worth $50 but obtained for free. They report heavy token consumption during use, limited tool support, and are now concerned about wasting credits due to expiration in four days. The model is praised for its 90% cache hit rate and 99% price reduction on cache hits, with the user noting it performs well in coding and planning tasks.

github llama.cpp · 11d ago

Fix for test-args-parser random failures on Windows

A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.

media r/LocalLLaMA · 11d ago

Board where every tile is an agent

A project called Jaz introduces a board where each tile functions as an independent agent responsible for maintaining its own state. The system is open source and available on GitHub, with a live demo at jaz.chat, requiring a coding agent like Claude Code or Codex to operate.

media r/LocalLLaMA · 11d ago

Deep Neural Network Turns Images into Playable Games Locally

A locally running deep neural network can turn any image into a playable game, using a small Transformer-like model trained from scratch. The model, running on an RTX 5090, generates game sequences autoregressively with real-time keyboard input, though it currently suffers from poor motion and context issues.

media r/LocalLLaMA · 11d ago

R9700 Purchase Decision Amid GPU Price Surge

A user expresses frustration over Nvidia's pricing, having bought two R9700 cards despite current prices of the RTX 5090 at $7,000 and RTX 6000 Pro at $13,500. They question whether the R9700 was a mistake given the significant price increases of newer Nvidia GPUs.

media r/LocalLLaMA · 11d ago

Advice? 2x 3090

A user asks for advice on using two NVIDIA RTX 3090 GPUs. The post includes an image and links to the original Reddit submission and comments.

media r/LocalLLaMA · 11d ago

You can now convert EXL3 quants on Apple Silicon Mac

Users can now convert and run EXL3 quantized models on Apple Silicon Macs with 64GB+ RAM. Tests show that models like MiniCPM5 and Qwen3.6-27B achieve performance on par with or slightly behind RTX-card-based conversions, with EXL3 offering superior quantization quality compared to MLX.

github llama.cpp · 11d ago

LLaMA.cpp Release b9739 Adds Win OpenCL Adreno ARM64 Support

LLaMA.cpp version b9739 adds support for Windows ARM64 using OpenCL Adreno. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 11d ago

Napkin math on collective hosting costs for diffusiongemma in 2026

A cost analysis estimates that hosting diffusiongemma at different user token levels results in monthly costs per user ranging from 1.7€ to 122.8€. The study finds agentic AI usage is economically unsustainable for collective hosting, though costs could decrease with new GPUs or ASICs and a shorter GPU depreciation period.

media r/LocalLLaMA · 11d ago

Two Word Docs Chatting via Local LLMs — Real Use Cases?

A prototype demonstrates two Word documents exchanging content using local LLMs, with iterative back-and-forth over multiple turns. Potential practical use cases include a draft document and critic document iterating together, or a specification document and implementation document collaborating, though the viability of such workflows remains uncertain.

github llama.cpp · 11d ago

llama.cpp release b9738: fixes CORS auth header forwarding and new binary builds

llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 11d ago

Any opinion about Qwen3.6-27B@BF16 vs Step3.7@IQ4_XS?

The user asks which model—Qwen3.6-27B at BF16 precision or Step3.7 with IQ4_XS quantization—would make saner, more autonomous decisions with less need for human guidance. The query compares a dense, high-precision model with a larger, lower-precision MoE model, noting trade-offs in memory and performance.

media r/LocalLLaMA · 11d ago

z.AI praises the number 1 open source model

z.AI, ranking as the number 2, has publicly praised the number 1 open source model. The post highlights admiration for the model's capabilities, emphasizing its performance and contributions to the community.