All articles — korshunov.ai

All articles Page 1 / 106

Offline GPU Build Picker Estimates Local Model Fit and Speed

A developer has released an offline, single-file HTML tool that estimates which local large language models will fit on a specific GPU configuration and predicts their token generation speed. The tool is designed to answer the common question of whether a custom PC build can run desired models effectively, without requiring a backend or user account.

media r/LocalLLaMA · 3h ago

Reddit user asks for updates on agent browser use frameworks and local model capabilities

A Reddit user inquires about the current state of agent browser use frameworks, specifically asking if improvements have been made to handle long workflows compared to previous experiences.

media r/LocalLLaMA · 3h ago

User seeks advice for running local LLMs on low-spec hardware

A Reddit user is asking for recommendations to run small local language models and potentially agentic tasks like Hermes on an old MacBook Pro with limited resources.

media r/LocalLLaMA · 3h ago

SpectralQuant Qwen3.5 0.8B Q4_K_M recovers 96.5% of BF16 gap

Spectral Labs has released a release candidate for a calibration-aware Q4_K_M quantization of the Qwen3.5 0.8B model, utilizing a new method called SpectralQuant. This approach aims to make standard Q4_K_M footprints behave more like larger quant formats while maintaining compatibility with llama.cpp.

media Ahead of AI · 4h ago

Setting Up a Local Coding Agent with Open-Source Tools

This article provides a tutorial on configuring a production-ready, fully local coding agent stack using open-source tools and open-weight large language models. It details how to combine a locally served LLM with a coding harness capable of reading files, making edits, running commands, and verifying changes.

media r/LocalLLaMA · 4h ago

Orthrus diffusion head trained Qwen 3.5/3.6 and Gemma 4 models dropping soon

The Orthrus project is preparing to release support for Qwen 3.5, Qwen 3.6, and Gemma 4 models using a diffusion head approach. The team has finalized testing and is currently setting up the release pipeline.

media r/LocalLLaMA · 4h ago

Reddit user spots new vision mode in DeepSeek app

A Reddit user observed a new vision mode within the DeepSeek application, prompting speculation about an upcoming vision model release. The user clarified that the feature is not an OCR tool, as it successfully described images containing no text.

media r/LocalLLaMA · 4h ago

Reports of 96GB VRAM RTX 5090s from Shenzhen's Huaqiangbei

Visitors to Shenzhen's Huaqiangbei electronics market have encountered reports and potential offers for modified Nvidia RTX 5090 graphics cards equipped with 96 gigabytes of video RAM. One seller indicated that such a hacked-up Blackwell RTX 6000 would cost approximately $8,200, comprising 36,000 yuan for the base card and an additional 20,000 yuan for the memory upgrade.

media r/LocalLLaMA · 4h ago

User asks for better coding models for single DGX Spark

A Reddit user with a single DGX Spark featuring 128 GB of unified memory is seeking recommendations for improved coding models, currently using StepFun step-3.7-flash and Qwen 3.6 variants.

media r/LocalLLaMA · 4h ago

Reddit Discussion on Qwen Finetune Performance

A Reddit user observes that while finetuning Qwen models is a popular practice, there is a notable lack of positive feedback regarding their performance. The user questions whether any Qwen finetunes have genuinely surpassed the base model capabilities.

media r/LocalLLaMA · 4h ago

DeepSeek-V4-Pro-DSpark model and paper released

DeepSeek has released the DeepSeek-V4-Pro-DSpark model on Hugging Face, along with its associated technical paper.

media r/LocalLLaMA · 4h ago

Fine-tuned LiquidAI’s LFM2.5-230M on Fable-5 coding traces

A user has fine-tuned LiquidAI’s LFM2.5-230M model on Fable-5 coding traces and released it as a GGUF file for local use.

media r/LocalLLaMA · 4h ago

llama.cpp PR #20793: Reintroducing less synchronizations during split compute

Pull request #20793 reintroduces reduced synchronization during split compute operations in llama.cpp, primarily targeting CUDA performance improvements. The changes involve exchanging synchronous copies for async copies and relaxing sync requirements between input copies on supported backends.

github llama.cpp · 4h ago

llama.cpp b9828 release: OpenCL Flash Attention improvements and new binaries

The llama.cpp b9828 release introduces significant OpenCL enhancements, specifically reworking the Flash Attention kernels for f16 and f32 precision. This update includes new prefill prepass kernels and support for q4_0 and q8_0 quantization formats.

media r/LocalLLaMA · 5h ago

User asks when merged DeepSeek V4 Flash and MiniMax M3 support will arrive in llama.cpp

A Reddit user is asking for an estimated timeline for the official merge of DeepSeek V4 Flash and MiniMax M3 model support into the main llama.cpp repository.

media r/LocalLLaMA · 5h ago

STT That Can Challenge Dragon Professional on Windows

A Reddit user seeks local LLM-based speech-to-text solutions for Windows that can rival Dragon Professional, specifically regarding the ability to edit pasted text and load words during recording.

media r/LocalLLaMA · 5h ago

Ornith-1.0-35B Q3_K_M: ~17 GB VRAM, KLD-checked against BF16

The author quantized the deepreinforce-ai/Ornith-1.0-35B model to Q3_K_M format, reducing its size to approximately 17 GB of VRAM while maintaining behavioral validity through KL divergence checks.

media r/LocalLLaMA · 5h ago

ContextForge: local SDK for long term memory that actually holds up over long runs

ContextForge is a new SDK designed to provide effectively unbounded context for LLMs without overwhelming the prompt window. It addresses the common issue of long-term memory systems failing during extended runs by treating the context window as a dynamic working set rather than permanent storage.

media r/LocalLLaMA · 5h ago

Troubleshooting P2P on 4x5060 Ti Bifurcation

A cloud systems engineer reports that using a single 4x4 bifurcation PCIe x16 card to connect four GPUs creates a bandwidth choke point for peer-to-peer (P2P) communication. This bottleneck saturates the fabric connecting the cards, resulting in performance worse than running with P2P disabled.

media r/LocalLLaMA · 5h ago

User asks about distilling models for agentic theorem proving

A user on r/LocalLLaMA is considering self-hosting models for agentic theorem proving to reduce costs, as they have hardware funding but no LLM credits. They propose distilling capabilities from a larger model into a smaller one suitable for niche use cases like Rocq, noting a lack of existing models for this specific language.