All articles — korshunov.ai — ML news

All articles Page 1 / 129

media r/LocalLLaMA · 10d ago

2× Radeon R9700 with Qwen 3.6 27B Q8 MTP on llama.cpp

A user reports running Qwen 3.6 27B MTP model on two Radeon R9700 GPUs via llama.cpp with ROCm 7.2.1. Tests show stable decode speeds (40–67 t/s) and prefill throughput (up to 1,500 t/s for prompts under 10k tokens), with MTP draft acceptance rates between 0.33 and 0.61.

media r/LocalLLaMA · 10d ago

Tokenomics Post on LocalLLaMA Reddit

A post titled 'Tokenomics' was submitted by /u/HOLUPREDICTIONS on the LocalLLaMA subreddit. It includes a visual diagram of token distribution and economic model, with a link to the image and comments section.

media r/LocalLLaMA · 10d ago

Can I realistically get close to Claude/Codex capabilities locally?

A user with a 32GB system asks if open-weight models can match Opus 4.8's 1M context and coding performance on local hardware. They note current bottlenecks are context length and privacy concerns, and question whether high-end models like GLM 5.2 or Qwen3.7 are feasible within a $3.5K budget, emphasizing that running 70-80B models offers marginal real-world gains over 27B models with 256K context.

media r/LocalLLaMA · 10d ago

ROCm vs Vulkan vs vLLM Performance on Dual R9700s

Tests show vLLM achieves significantly higher generation speeds on Qwen3.6 models, with 35B-A3B reaching 156 t/s using ROCm and AITER. ROCm outperforms Vulkan in both 35B and 27B models, with speeds of ~106 t/s and ~44 t/s respectively, while Vulkan achieves ~87 t/s and ~41 t/s.

github llama.cpp · 10d ago

llama.cpp release b9747 adds real-time model load tracking and new platform binaries

llama.cpp version b9747 introduces real-time model load progress tracking via SSE endpoints. The release includes binaries for macOS, Linux, Android, Windows, and openEuler, supporting various architectures and acceleration technologies like Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 10d ago

Sandboxing code execution for AI agents

A discussion on effective sandboxing methods for AI agents executing arbitrary code, evaluating Docker containers, microVMs, WASM, and host-level execution. The post highlights requirements for isolation, fast startup, network access control, and persistent filesystem support across executions, while asking for shared implementations and accepted tradeoffs.

github llama.cpp · 10d ago

llama.cpp release b9745 adds MTP3 support and cross-platform binaries

llama.cpp version b9745 introduces support for Step3.5/3.7 flash MTP3, including new APIs for layer offset and nextn flags. The release provides prebuilt binaries for macOS, Linux, Android, Windows, and openEuler, with options for CPU, Vulkan, CUDA, OpenVINO, and SYCL acceleration.

media r/LocalLLaMA · 10d ago

Running MiMo-2.5 on Two Halo Strixeses

A user reports running MiMo-2.5 on two 128GB machines with Intel 8060 processors, using Proxmox containers and USB4Net for connectivity. The setup achieves 356pp and 15tg performance at 1% or 10k context length, though the user questions whether this is viable or elite-tier performance. They also note difficulties building vLLM and sglang for consumer hardware, stating vLLM is unreliable and sglang is designed for datacenters, not personal systems.

media r/LocalLLaMA · 10d ago

8-16 MI50s Minimax M3 @19 tps TG (peak)

A local LLM run on 8-16 MI50 GPUs achieves up to 19 tokens per second (TPS) peak throughput for the Minimax M3 model. Performance is limited by long reasoning outputs and code quality, with speculative decoding showing 50% acceptance rate and high latency, indicating usability challenges for agentic coding tasks.

media r/LocalLLaMA · 10d ago

Thinking Loop Bug in OpenCode with Local Model

A user reports that OpenCode enters an infinite 'thinking loop' when using local models, prompting itself continuously without ending. The issue occurs across multiple models and configurations, including Qwen and GPT-OSS, and persists in both llama.cpp and LMStudio environments, though the chat window in LMStudio functions normally.

media r/LocalLLaMA · 10d ago

Claude Will Soon Require Identity Verification

Anthropic will soon require users to verify their identity to access Claude. The change is intended to enhance security and ensure responsible use of the platform.

media r/LocalLLaMA · 10d ago

R9700 GPU Performance Issues with vLLM and Multi-GPU Setup

A user reports severe performance issues with their two AMD R9700 GPUs, failing to run vLLM with tensor parallelism (tp=2) due to NCCL errors. Single-card inference shows extremely low throughput—30 tps for Qwen 0.6B and only 5 tps for a 27B INT4 AWQ model—despite proper ROCm installation and system configuration.

media r/LocalLLaMA · 10d ago

Why is AutoRound being slept on so hard?

AutoRound significantly outperforms standard AWQ and RTN in perplexity and accuracy, especially for complex reasoning and long contexts. It natively exports to GGUF, bypassing conversion issues, and runs on any PyTorch setup, yet remains underused despite these advantages.

media r/LocalLLaMA · 10d ago

I mapped every agent config file and tagged real adoption

A guide lists 21 agent configuration conventions across 11 categories, tagged as adopted, emerging, or proposed. The guide includes real examples from public repositories and explicitly notes hype, such as llms.txt being widely published but unconfirmed by major providers.

media r/LocalLLaMA · 10d ago

Proposal for splitting base models to avoid retraining

A proposal suggests splitting model architecture into a stable base model and lightweight, swappable worker models. The base model handles core reasoning and acts as a platform, while worker models provide domain-specific knowledge through runtime hot-plugging, similar to LoRA but for knowledge rather than behavior.

media r/LocalLLaMA · 10d ago

Watch local LLMs escape the rooms you design

A new tool allows users to design escape room-style environments and watch local LLMs navigate and escape using simple actions. The project, built for Hugging Face x Gradio's 'Build Small' hackathon, supports five model presets and enables custom map creation with font-based visuals and JSON import/export. It uses a 'Think then Act' framework to enable small models to perform reliably in structured game environments.

media r/LocalLLaMA · 10d ago

GLM-5.2 Beats Gemini and GPT-5.4 in Coding but Is Inefficient

GLM-5.2 surpasses GPT-5.4 and the entire Gemini lineup in coding performance on the DeepSWE benchmark. However, it requires significantly more output tokens, making it substantially less efficient in terms of cost-per-task compared to models like GPT-5.5 and Claude Opus 4.8.

media r/LocalLLaMA · 10d ago

Gemma 4 QAT responds better to KV cache quantization

A Reddit post reports that Gemma 4 QAT shows significant improvement in performance when using KV cache quantization, as measured on the wikitext dataset with 16k context. The user notes their hardware limits testing 31B models and invites others to explore the results.

media r/LocalLLaMA · 10d ago

Fable vs GLM 5.2 vs KIMI K2.7 (YouTube Video)

A YouTube video compares the performance of Fable, GLM 5.2, and KIMI K2.7. The video is shared on Reddit's r/LocalLLaMA and includes a link to the video and related comments.

media r/LocalLLaMA · 10d ago

Vercel CEO says almost shocked by GLM-5.2's coding abilities

Guillermo Rauch, CEO of Vercel, stated he is 'genuinely impressed, almost shocked' by GLM-5.2's performance in coding tasks. He shared this feedback in a post on X, highlighting the model's strong capabilities in code generation.