All articles — korshunov.ai

All articles Page 1 / 129

SupraLabs Releases supra-title-FFT-preview with 115K Samples

SupraLabs has launched supra-title-FFT-preview, a chat title generation model trained on 115K samples from a filtered dataset, expanding coverage beyond its previous 12K-sample model. The model uses full fine-tuning on LiquidAI/LFM2.5-350M-Base with BF16 precision and is designed for single-purpose chat title generation, available via Hugging Face and supporting direct loading or vLLM deployment.

media r/LocalLLaMA · 11d ago

RTX 5090 MSI Power Usage and Cable Warning

The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.

media r/LocalLLaMA · 11d ago

Attention Algebra — a grammar that translates natural language into spectrograms

Attention Algebra is a prototype that translates natural language into algebraic expressions, maps them to mathematical dynamics, and visualizes the result as a spectrogram. It treats language as a lossy projection of high-dimensional states, proposing that raw attention patterns grouped into functions serve as the 'DNA' of text, enabling efficient reasoning chains by reducing token usage from 20k to 4k.

github llama.cpp · 11d ago

LLaMA.cpp Release b9732: New Binaries and Updates

LLaMA.cpp releases version b9732 with updated binaries for macOS, Linux, Android, Windows, and openEuler. The release includes refactored child-to-router communication, fixes to wakeup handling, improved update_status(), and documentation. New builds support Vulkan, ROCm, OpenVINO, SYCL, and CUDA 12/13 on multiple architectures.

media r/LocalLLaMA · 11d ago

I benchmarked Claude's 'Fast C++'. It wasn't faster

A user tested Claude's claimed 'Fast C++' implementation and found it did not outperform standard C++ in benchmarks. The post includes a link to a Substack article detailing the testing process and results.

github llama.cpp · 11d ago

ggml-webgpu Adds F16 Adapter Toggles for Vulkan and NVIDIA

The ggml-webgpu project has added adapter toggles for half-precision (F16) support on Vulkan and NVIDIA GPUs. This update enables improved performance on compatible hardware across multiple platforms, including macOS, Linux, Android, Windows, and openEuler, with specific builds available for ARM and x64 architectures.

media r/LocalLLaMA · 11d ago

$1800 GPU cost runs Qwen3.6-27B with 262K context and 55 tok/s

A setup using four 5060 Ti GPUs (totaling $1800) achieves 55 tokens per second with Qwen3.6-27B-FP8, supporting 262K context length and bfloat16 KV cache. The configuration uses P2P and FlashInfer, with benchmark results showing 55.67 output token throughput and 65.25% speculative decoding acceptance rate.

blog Simon Willison · 11d ago

Sean Lynch on MCP's Auth Flow Isolation

Sean Lynch highlights that the Model Context Protocol (MCP) offers a key advantage by isolating authentication flows outside the agent's context window. He suggests the ideal form of MCP could be a simple auth gateway for APIs, which would still represent a significant improvement.

github llama.cpp · 11d ago

llama.cpp Release b9731: Performance Optimization and Cross-Platform Binaries

llama.cpp version b9731 introduces optimization using std::partial_sort to reduce token sorting overhead, improving performance from 8.555ms to 0.704ms for top-n token selection. The release includes prebuilt binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options.

github llama.cpp · 11d ago

llama.cpp release b9730: fixes and new binaries

llama.cpp version b9730 includes fixes for UTF-8 handling on Windows and improvements to ggml_fopen and CLI. The release provides binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures and hardware acceleration options, including Vulkan, CUDA, OpenVINO, and SYCL.

media r/LocalLLaMA · 11d ago

Best Local Agents - Jun 2026

A discussion thread identifies the best local AI agents available today, emphasizing open-weight models and local hardware execution. The post defines 'agents' as autonomous software that self-determines actions without pre-programming, distinguishing them from tools like IFTTT or Apple Shortcuts, and sets rules requiring local deployment and open-source agent software as a primary focus.

github Open Interpreter · 11d ago

Rust Release 0.0.12

Rust version 0.0.12 has been released. This early version is part of Rust's initial development phase and includes foundational features for the language.

github Open Interpreter · 11d ago

Rust Release 0.0.13

Rust version 0.0.13 has been released. This early version is part of Rust's initial development phase and includes foundational features for the language.

github Open Interpreter · 11d ago

Rust Release 0.0.14

Rust version 0.0.14 has been released. This early version is part of Rust's initial development phase and includes foundational features for the language.

media r/LocalLLaMA · 11d ago

Help Running Local Hermes Agent with llama-cpp

A user reports issues running a local Hermes AI agent on a high-end rig using self-compiled llama-cpp. The setup experiences frequent KV cache reprocessing every 5 messages and slow reasoning, with the agent repeatedly pausing to report progress instead of continuing autonomously. The user seeks guidance on whether their llama-cpp parameters are incorrect or what adjustments can improve agent performance and sustained reasoning without interruptions.

media r/LocalLLaMA · 11d ago

Maximizing Performance of 2x3090 with NVLink

A user reports achieving only 60 tokens per second in short bursts and average 40-45 TPS when running Qwen 3.6 27B with Q8_0 quantization on two GeForce 3090 GPUs connected via NVLink. The setup includes Ubuntu 24.04, Ryzen 7950x3D, and 64GB DDR5, with display routed through an eGPU.

github llama.cpp · 11d ago

LLaMA.cpp Release b9729: New Binaries and Platform Support

LLaMA.cpp releases version b9729 with binaries for macOS, Linux, Android, Windows, and openEuler across multiple architectures. The release includes CPU, Vulkan, OpenVINO, SYCL, and ROCm support, along with a new UI package. Internal references to 'webui' have been removed.

media r/LocalLLaMA · 11d ago

SupraLabs Releases SupraVL-Nano-900k Vision-Language Model

SupraLabs has launched SupraVL-Nano-900k, a fully transparent, 900k-parameter vision-language model trained from scratch on Flickr8k. It features a CNN visual encoder, GPT-2-style decoder, and prefix concatenation fusion, with all components openly documented and designed for educational clarity.

media r/LocalLLaMA · 11d ago

How to Set Optimal llama.cpp Parameters for AMD GPU

Users seeking optimal llama.cpp settings for gemma 4 models on an AMD GPU with 16GB VRAM ask whether trial and error is necessary. They reference Google's default settings for temperature, top-p, and top-k but note inconsistent results, indicating a need for more targeted guidance beyond official documentation.

media r/LocalLLaMA · 11d ago

Fixing Long-Context Decode Cliff on Radeon R9700 with vLLM 0.22.1

A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.