All articles — korshunov.ai

All articles Page 1 / 106

User asks about distilling models for agentic theorem proving

A user on r/LocalLLaMA is considering self-hosting models for agentic theorem proving to reduce costs, as they have hardware funding but no LLM credits. They propose distilling capabilities from a larger model into a smaller one suitable for niche use cases like Rocq, noting a lack of existing models for this specific language.

blog Simon Willison · 4h ago

Dean W. Ball on AI Industry Dynamics and Global Markets

Dean W. Ball highlights critical industry dynamics where the high costs of training frontier models are recouped only during a narrow post-release window before competition compresses margins.

media r/LocalLLaMA · 4h ago

User purchases used Minisforum MS-S1 Max for local LLM workloads

A user shares their decision to buy a lightly used Minisforum MS-S1 Max with 128GB of memory for approximately US$2800, citing rising costs of Apple hardware and closed-model services as primary motivators. The author compares this purchase favorably against the new Geekom A9 Mega, highlighting the MS-S1's specific advantages including 10Gbe networking, 80Gbps USB4v2, a PCIe slot, and an internal power supply.

media r/LocalLLaMA · 4h ago

Kokoro Enhancements Ported for Web and Python Projects

The author has released web-based and Python versions of enhancements to Kokoro's voice controls, designed to be easily ported into other projects. Both implementations are fully client-side, with the web version achieving approximately 40ms per generation when hardware acceleration is enabled via WebGPU.

media r/LocalLLaMA · 4h ago

Nemotron-3-Super-120B-A12B achieves perfect needle retrieval to 504K tokens on 4×3090

A user tested NVIDIA's Nemotron-3-Super-120B-A12B model, which combines hybrid Mamba and MoE architectures, achieving exact recall in needle-in-the-haystack tests up to 504,482 tokens. The model was run fully on GPU across four RTX 3090s using the i1-Q4_K_S quantization, demonstrating that its Mamba layers maintain a constant-size recurrent state rather than a growing KV cache.

media r/LocalLLaMA · 5h ago

Testing Qwen3.6-35B-A3B on RTX 3060 for Receipt-to-JSON Extraction

A user replaced Google Vision in a receipt processing pipeline with the local Qwen3.6-35B-A3B model running on an RTX 3060 GPU. The experiment demonstrated that the local setup could successfully parse key fields from Japanese receipts into JSON format.

blog Simon Willison · 5h ago

Timothy B. Lee on LLMs and Learning Curves

Timothy B. Lee critiques the notion that using large language models requires no skill or learning curve.

media r/LocalLLaMA · 5h ago

Config for daily beta llama.cpp vulkan on 7900xtx/ubuntu

A user shares a bash configuration script for running the Qwen3.6-35B-A3B IQ4_XS model using the Vulkan backend in llama.cpp on an AMD 7900 XTX GPU with Ubuntu.

media r/LocalLLaMA · 5h ago

Upgraded my budget build to multi-GPU for inference

A user upgraded a budget PC with two RTX 3090s and an Intel Arc A770 to test multi-GPU inference performance using llama.cpp. The primary finding is that the Vulkan backend causes excessive memory overhead compared to CUDA, making it unsuitable for mixed-vendor setups.

media r/LocalLLaMA · 5h ago

vulkan: make TP viable by pwilkin · Pull Request #25051

A pull request submitted to the ggml-org/llama.cpp repository aims to improve the viability of Vulkan Tensor Parallelism. The contributor, identified as Piotr, has implemented changes intended to make this feature more usable.

media r/LocalLLaMA · 5h ago

Developer builds local-first LLM harness and seeks community feedback

A developer with 45 years of software experience is completing a local-first harness for running local and API models, featuring logic around multiple agents. The author has spent six months building tools to improve the local LLM workflow and is now asking the community what features would enhance their experience.

media r/LocalLLaMA · 5h ago

Why do people keep investing in Intel for AI?

The article questions the rationale behind Wall Street's classification of Intel as an "AI picks and shovels" investment, asking who is actually purchasing Intel hardware for AI data centers.

media r/LocalLLaMA · 6h ago

Reddit user seeks advice on multi-model backends and config swapping

A Reddit user is planning to deploy a machine with multiple GPUs for serving coding and Hermes models, seeking solutions that allow flexible configuration swapping without manual intervention.

media r/LocalLLaMA · 6h ago

Consider post-training instead of benchmarking for new hardware

The author argues that acquiring new hardware should be used for supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) rather than standard model benchmarking. This approach offers a viable path to monetization by leveraging open models, especially as proprietary APIs become less accessible or more expensive.

blog Simon Willison · 6h ago

2,000 people tried to hack my AI assistant

Fernando Irarrázaval conducted a challenge on hackmyclaw.com to test if 6,000 attempts could leak secrets from his OpenClaw instance using the Opus 4.6 model.

blog Simon Willison · 6h ago

Spectacular hypothetical incident report by Andrew Nesbitt

Andrew Nesbitt published a speculative incident report detailing a scenario where two AI review agents from competing vendors enter a disagreement loop over the safety of the 'foxhole-lz4' package.

media r/LocalLLaMA · 6h ago

Streaming medical STT running locally on a MacBook

A developer has created a streaming medical speech-to-text model that operates fully on-device, demonstrated via MLX on a MacBook. The project is currently undergoing further evaluations, with open weights planned for release next week.

media r/LocalLLaMA · 6h ago

Book Review: Domain-Specific Small Language Models by Guglielmo Iozzia

This review evaluates Guglielmo Iozzia's book "Domain-Specific Small Language Models," which advocates for a paradigm shift from generalist large language models to specialized, fine-tuned small language models (SLMs). The reviewer argues that SLMs offer superior control, visibility, and cost-efficiency for narrow tasks compared to the hype surrounding artificial general intelligence.

media r/LocalLLaMA · 6h ago

Distill-on-idle pipeline for on-device memory assistant using 4B models

The article details an engineering approach to building a local AI assistant that converts raw screen captures and meeting transcripts into queryable data using only models that run efficiently on laptops. The system leverages Apple's Vision framework for OCR, idle-time distillation of a 4B Gemma model, and hybrid retrieval to avoid performance bottlenecks.

blog Simon Willison · 6h ago

OpenAI previews GPT-5.6 series with Sol, Terra, and Luna models

OpenAI has initiated a limited preview of the GPT-5.6 model series, introducing three distinct variants: Sol as the flagship, Terra for balanced everyday work, and Luna for fast, affordable tasks. The company plans to make these models generally available in the coming weeks following this initial phase with trusted partners.