All articles — korshunov.ai

All articles Page 1 / 130

No European inference providers for GLM 5.2 or DeepSeek V4 Flash

A Reddit user notes that Openrouter lists 16 providers for GLM 5.2, all based in the US, Singapore, or China. The user questions the absence of any European providers specifically running Chinese open-weight models like GLM 5.2 or DeepSeek V4 Flash.

media r/LocalLLaMA · 9d ago

QAT KV Cache Quantization for Gemma 4 31B Shows Massive Improvement

QAT KV cache quantization for Gemma 4 31B significantly reduces KL divergence compared to standard quants. QAT q8_0 achieves a worst-case divergence of 1.5, outperforming standard q4_0 by a factor of about 38, and QAT q4_0 surpasses standard q8_0 in performance, with much lower output drift and no catastrophic outliers.

media r/LocalLLaMA · 9d ago

Ling and Ring 2.6 Technical Report Releases Trillion-Parameter Models

Ling and Ring 2.6 has released base models for Ling-2.6-1T and Ling-2.6-flash, both available on Hugging Face. The Ling-2.6-flash model (100B parameters) enables fast inference for users with 24-32GB VRAM, offering high throughput on CPU-only inference with 32GB RAM.

media r/LocalLLaMA · 9d ago

Can Jetson Orin Nano Run a Coding Model Like Qwen?

A user asks if a Jetson Orin Nano can run a coding model such as Qwen. They consider Qwen 35B with MOE 3B but note it may be too large for the device.

media r/LocalLLaMA · 9d ago

Gemma 4 QAT 31B responds better to KV cache quantization

A benchmark shows that Gemma 4 QAT 31B performs better with KV cache quantization compared to previous versions. The results were derived from a post on the LocalLLaMA subreddit, where user justicecurcian shared performance data.

github llama.cpp · 9d ago

Fix for edit_file crash on append at file end

A crash in file editing when appending at the end of a file was fixed by normalizing -1 to n (insert at end) instead of n+1. The patch restricts -1 to append mode and rejects it for replace/delete operations to prevent silent overwriting of the last line, and ensures insert offset is computed as an integer to avoid heap-buffer-overflow.

media r/LocalLLaMA · 9d ago

Support for Step3.5/3.7 Flash MTP3 Added

A pull request adds support for Step3.5 and Step3.7 Flash MTP3 in llama.cpp. This enhancement enables improved performance with specific models by leveraging multi-layer MTP3 operations. The update is available in the latest version of llama.cpp and follows up on PR #23274.

media r/LocalLLaMA · 9d ago

Gemma 4 31B Q6 Runs at 8-9 t/s on Dual 9060 XT Cards

A user reports running Gemma 4 31B Q6 on two NVIDIA 9060 XT 16GB cards, achieving consistent throughput of 8-9 tokens per second. They note the performance is usable but below expectations, suggesting potential optimizations or hardware limitations.

media r/LocalLLaMA · 9d ago

Will dedicated hardware for local LLMs become affordable soon?

Users ask if dedicated hardware for running local large language models will become affordable for consumers soon. They note that while models like Qwen 27B are effective, hardware costs remain high, and wonder if Chinese manufacturers—despite challenges in chip fabrication and software—could deliver low-cost, scalable solutions.

media MarkTechPost · 10d ago

The 7 Types of Agent Memory: A Technical Guide

Large language models are stateless by default, requiring memory mechanisms to retain context across interactions. The seven types of agent memory—working, semantic, episodic, procedural, retrieval, parametric, and prospective—categorize memory by form and duration, enabling agents to plan, learn, and act over time. Each type serves distinct use cases, from storing user preferences to scheduling future goals, and together they form a comprehensive system for long-horizon, context-aware AI agents.

media MarkTechPost · 10d ago

Tutorial on Building Python-First Interactive Dashboards with Prefab

This tutorial demonstrates how to create interactive dashboards in Python using Prefab's component-based UI framework. It generates synthetic pipeline data, integrates reactive controls like charts, forms, and tabs, and exports the app as a static HTML file for direct preview in Google Colab.

media Hugging Face Forums · 10d ago

Capability Is Not in the Weights: Empirical Negative Result on MLP Weight Projection

An empirical study found that projecting MLP weights from one transformer model into another fails to transfer semantic capability. Every tested variant performed worse than the unmodified host model, indicating a structural limitation in weight projection. The results challenge public claims about model capabilities based on benchmarks, showing such claims do not reflect actual internal weight geometry.

media Hugging Face Forums · 10d ago

The Clockwork Dark: A Local-First AI Narrative-RPG Engine

The Clockwork Dark is a local-first, AI-driven narrative-RPG engine that uses a deterministic state machine to resolve all game mechanics. It features two autonomous LLMs that narrate the story, with one acting as a patient world voice and the other as an unreliable, godlike assistant. The game offers players a choice: fight the encroaching supernatural corruption or embrace a quiet life in a bakery, with both paths considered valid endings.

media Hugging Face Forums · 10d ago

Infinitely stuck on 'starting' with Docker container running

A user reports their Docker container with R/Shiny on rocker/r2u successfully builds and shows 'Listening on http://0.0.0.0:7860' in logs, yet the space remains in 'starting' state and is inaccessible. The issue persists despite no code errors, and the user seeks broader attention, noting it may be a platform-side problem with Hugging Face.

media Hugging Face Forums · 10d ago

NOVA-VAD beats Silero, Pyannote, and WebRTC on noisy audio with 93% accuracy

NOVA-VAD, a lightweight and explainable Voice Activity Detector, achieves 93% accuracy on noisy audio from the UrbanSound8K dataset, outperforming WebRTC (58%), Pyannote (62%), and Silero (87%). It uses only scikit-learn, requires no GPU, and provides feature importance and confidence scores in plain English.

media Hugging Face Forums · 10d ago

Small-scale debug comparison of OLMo-core with Engram graft

A 200-step training comparison between a base OLMo3 600M model and a version with a DeepSeek-style Engram graft shows lower training and evaluation loss, faster grad-norm stabilization, and improved early learning behavior. The Engram graft, injected into layers 1 and 5, increases trainable parameters to ~1.7B but maintains only a 40k increase in active parameters per token, indicating efficient memory usage.

media Hugging Face Forums · 10d ago

LLMs as Epistemic Accelerators: The Risk Is Not Only Hallucination

LLMs do not merely hallucinate; they amplify human epistemic overconfidence by turning weak hypotheses into coherent, polished claims before evidence is verified. This creates a risk of premature certainty in research, policy, and other domains, not because models lie, but because they accelerate human tendencies to favor elegant explanations over uncertainty.

media Hugging Face Forums · 10d ago

Tenstorrent AI Accelerator Cards Available

Tenstorrent has released Wormhole and Blackhole AI accelerator cards. The hardware section lists these cards, with discussions on which models are likely compatible.

media Hugging Face Forums · 10d ago

Space stuck 'Restarting' on old commit for 16+ hours

A Hugging Face Space has been stuck showing 'Restarting' on commit 8240352 for over 16 hours, despite multiple newer commits building successfully. The container starts healthily in logs, but traffic never switches to the new version, and recovery actions like factory rebuild or restart have no effect.

blog Eugene Yan · 10d ago

Patterns for Building Cybersecurity Evals

A framework for cybersecurity evaluation includes a sandboxed target, varied inputs affecting task difficulty, available tools, and a grader to assess outcomes.