All articles — korshunov.ai

All articles Page 1 / 130

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 12d ago

Maximizing Performance of 2x3090 with NVLink

A user reports achieving only 60 tokens per second in short bursts and average 40-45 TPS when running Qwen 3.6 27B with Q8_0 quantization on two GeForce 3090 GPUs connected via NVLink. The setup includes Ubuntu 24.04, Ryzen 7950x3D, and 64GB DDR5, with display routed through an eGPU.

github llama.cpp · 12d ago

LLaMA.cpp Release b9729: New Binaries and Platform Support

LLaMA.cpp releases version b9729 with binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, along with a new UI package. Internal references to 'webui' have been removed.

media r/LocalLLaMA · 12d ago

SupraLabs Releases SupraVL-Nano-900k Vision-Language Model

SupraLabs has launched SupraVL-Nano-900k, a fully transparent, 900k-parameter vision-language model trained from scratch on Flickr8k. It features a CNN visual encoder, GPT-2-style decoder, and prefix concatenation fusion, with all components openly documented and designed for educational clarity.

media r/LocalLLaMA · 12d ago

How to Set Optimal llama.cpp Parameters for AMD GPU

Users seeking optimal llama.cpp settings for gemma 4 models on an AMD GPU with 16GB VRAM ask whether trial and error is necessary. They reference Google's default settings for temperature, top-p, and top-k but note inconsistent results, indicating a need for more targeted guidance beyond official documentation.

media r/LocalLLaMA · 12d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.

media r/LocalLLaMA · 12d ago

How to Setup Search with AI Models

A user asks how to integrate Gemma 4 12B with search capabilities using self-hosted AI models. They mention trying openwebui, which has issues with search engines like DDG, and seek alternatives that avoid using Brave or Google API keys.

github llama.cpp · 12d ago

LLaMA.cpp Release b9728 Adds Comment Line Support and Multiple Platform Binaries

LLaMA.cpp version b9728 introduces support for comment lines in --api-key-file configuration. The release includes pre-built binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 12d ago

GLM-5.2-REAP50-GGUF Models Available on Hugging Face

GLM-5.2-REAP50-GGUF models are available on Hugging Face, offering two quantized versions: Q3_K_M (182 GB) and Q2_K (139 GB). The models are compared in a Reddit post to Qwen 3.6 27b, though no direct performance evaluation is provided.

media r/LocalLLaMA · 12d ago

Can You Use an SSD to Extend Memory Without SWAP on Mac Mini m4?

A user asks if an SSD can be used to extend memory for running large AI models on a Mac Mini with M4 chip and 24GB unified memory. They report that while GPT-120B runs successfully, it consumes 50GB of SWAP volume and barely uses their 330GB SSD for KV slots and GGUF files, despite expecting mmap to enable SSD memory extension.

media r/LocalLLaMA · 12d ago

Commission selects EUROPA consortium as winner of Frontier AI Grande Challenge

The European Commission has chosen the EUROPA consortium, led by Domyn, to develop an open-source frontier AI model in all 24 EU languages. The project, launched in February 2026, aims to create a model with over 400 billion parameters, showcasing Europe's capacity to build advanced AI on its own infrastructure.

media r/LocalLLaMA · 12d ago

Improving local models with an API-based consultant agent

A user asks whether adding a powerful API-based 'consultant' agent, such as GLM 5.2, could enhance local AI workflows by refining plans and learning processes. The post explores the potential benefits of such an agent in improving local model performance through external consultation.

github llama.cpp · 12d ago

llama.cpp release b9726 adds --agent arg and new platform binaries

llama.cpp version b9726 introduces a new --agent argument and removes redundant webui naming compatibility. The release includes precompiled binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 12d ago

llama.cpp Release b9727: Update to cpp-httplib 0.48.0

llama.cpp version b9727 updates cpp-httplib to version 0.48.0. The release includes binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 12d ago

The economics of AI are starting to favor open models

Recent AI model releases show that high-intelligence, low-cost models are increasingly dominated by open-weight models like DeepSeek, Qwen, GLM, Kimi, and MiniMax. For most real-world applications, the performance gap between frontier closed models and strong open models is shrinking faster than cost differences, making open models competitive in terms of both capability and price.

media r/LocalLLaMA · 12d ago

LQ50-24 English Translation Available

A full English translation of LQ50-24 has been shared using Google Translate. The post was submitted by user /u/MundanePercentage674 on Reddit's LocalLLaMA community.

media Don't Worry About the Vase · 12d ago

Claude Fable 5 and Mythos 5: Capabilities

Anthropic launched Claude Fable 5, a Mythos-class model claiming state-of-the-art performance across software engineering, scientific research, and knowledge work. It was quickly taken down by the U.S. government after a jailbreak was reported, though Anthropic asserts it is now available again, with Fable 5 showing exceptional capabilities and a more nuanced, thoughtful reasoning style compared to prior models.

media r/LocalLLaMA · 12d ago

Benchmarking or benchmarketing?

LLM benchmarking is increasingly seen as marketing rather than objective measurement. Users question which benchmarks are genuinely meaningful for local models, rather than superficial score-based claims.

github llama.cpp · 12d ago

Docker: Build the UI (#24794)

The Docker project has added support for building the UI component. This update also includes using the existing APP_VERSION in the container configuration.

media r/LocalLLaMA · 12d ago

Adding a Second GPU to X670E Motherboard for Local LLMs

A user wants to add a second 16GB VRAM GPU (5060 Ti or 5070 Ti) to their MSI X670E Tomahawk WiFi motherboard for running large local LLMs like Qwen 3.6 27B. The current setup lacks space for a second GPU due to the primary 5070 Ti occupying the second PCIe slot, leaving only the third slot partially available. The user seeks advice on feasible options—such as using the fourth PCIe slot or a riser—while considering cooling, stability, and physical fit, especially with a horizontal GPU mount like the Lian Li VG4v4.