Stream-Stall Hint Updated in v2.1.185
The stream-stall hint now displays "Waiting for API response · will retry in …" and activates after 20 seconds of silence, replacing the previous message and delay.
The stream-stall hint now displays "Waiting for API response · will retry in …" and activates after 20 seconds of silence, replacing the previous message and delay.
A Reddit user shared that six months ago they declined a $8,165 offer for an RTX 6000 PRO GPU. The same vendor now lists the same GPU for $11,575, prompting the user to reflect on their decision with hindsight.
Users reporting local GLM 5.2 inference speeds using llama.cpp on 6x RTX 3090 with 128GB DDR5 and an i7-13700K achieve 7.8 tokens/sec at 90K context size with Q8_0 quantization. Prompt processing occurs at approximately 40 tokens/sec.
llama.cpp version b9741 introduces new binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes support for Vulkan, CUDA 12.4 and 13.3, OpenVINO, SYCL, and ROCm, with updated versions for iOS and Ubuntu.
I wrote a free 15-part series detailing LLM internals, using Gemma 4 12B as the core example. Each part covers technical aspects from tokenization to serving, with real math, tensor shapes, and hardware constraints. The series includes a companion vLLM Deep Dive and is fully accessible without paywalls or email.
The Qwen Code Companion extension for VSCode is now available on the marketplace and has been open-sourced at https://github.com/QwenLM/qwen-code. The user reports it performs well with LM Studio-hosted models, outperforming other local LLM tools like continue, kilo, cline, and roo, with minimal configuration needed.
A user claims Gemma 4 26b a4b is the best model they've tried for language learning and scientific queries, outperforming Qwen 3.5/3.6 in these domains. The post highlights a gap in available small MOE models between 20b and 30b, suggesting a need for more options beyond coding and agentic tasks.
A user has 24B token credits from a Xiaomi token plan competition, worth $50 but obtained for free. They report heavy token consumption during use, limited tool support, and are now concerned about wasting credits due to expiration in four days. The model is praised for its 90% cache hit rate and 99% price reduction on cache hits, with the user noting it performs well in coding and planning tasks.
A patch addresses random failures in the test-args-parser on Windows by modifying argv override to only apply when argc matches, preventing clobbering of programmatic arguments. This fixes a fastfail assertion in the OpenVINO Windows workflow while preserving UTF-8 handling for real binaries.
A project called Jaz introduces a board where each tile functions as an independent agent responsible for maintaining its own state. The system is open source and available on GitHub, with a live demo at jaz.chat, requiring a coding agent like Claude Code or Codex to operate.
A locally running deep neural network can turn any image into a playable game, using a small Transformer-like model trained from scratch. The model, running on an RTX 5090, generates game sequences autoregressively with real-time keyboard input, though it currently suffers from poor motion and context issues.
A user expresses frustration over Nvidia's pricing, having bought two R9700 cards despite current prices of the RTX 5090 at $7,000 and RTX 6000 Pro at $13,500. They question whether the R9700 was a mistake given the significant price increases of newer Nvidia GPUs.
A user asks for advice on using two NVIDIA RTX 3090 GPUs. The post includes an image and links to the original Reddit submission and comments.
Users can now convert and run EXL3 quantized models on Apple Silicon Macs with 64GB+ RAM. Tests show that models like MiniCPM5 and Qwen3.6-27B achieve performance on par with or slightly behind RTX-card-based conversions, with EXL3 offering superior quantization quality compared to MLX.
LLaMA.cpp version b9739 adds support for Windows ARM64 using OpenCL Adreno. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and APIs, including Vulkan, CUDA, OpenVINO, and SYCL.
A cost analysis estimates that hosting diffusiongemma at different user token levels results in monthly costs per user ranging from 1.7€ to 122.8€. The study finds agentic AI usage is economically unsustainable for collective hosting, though costs could decrease with new GPUs or ASICs and a shorter GPU depreciation period.
A prototype demonstrates two Word documents exchanging content using local LLMs, with iterative back-and-forth over multiple turns. Potential practical use cases include a draft document and critic document iterating together, or a specification document and implementation document collaborating, though the viability of such workflows remains uncertain.
llama.cpp version b9738 fixes the CORS proxy to avoid forwarding authentication headers. The release includes binary builds for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.
The user asks which model—Qwen3.6-27B at BF16 precision or Step3.7 with IQ4_XS quantization—would make saner, more autonomous decisions with less need for human guidance. The query compares a dense, high-precision model with a larger, lower-precision MoE model, noting trade-offs in memory and performance.
z.AI, ranking as the number 2, has publicly praised the number 1 open source model. The post highlights admiration for the model's capabilities, emphasizing its performance and contributions to the community.