All articles — korshunov.ai

All articles Page 1 / 104

Open-sourcing a harness for evaluating VLMs on your own video with traced runs

The authors have open-sourced a harness for evaluating Vision-Language Models (VLMs) that allows users to test models on their own video data with full reproducibility through traced runs. This tool ties every result to its specific input and configuration, enabling accurate evaluation of accuracy, latency, and cost.

media r/LocalLLaMA · 4h ago

Reddit Discussion: Local AI Workflows

A Reddit post in the r/LocalLLaMA community asks users to share local AI workflows that significantly improved their productivity or utility. The author specifically invites suggestions regarding RAG, MCP, coding agents, prompt organization, document indexing, and automation.

media r/LocalLLaMA · 4h ago

User asks whether to buy one RTX Pro 6000 or two DGX Sparks for local AI development

A Reddit user is seeking hardware recommendations for running multiple small to medium-sized models locally for data parsing, extraction, and reasoning tasks. The user intends to use the setup for model building, testing, LoRA creation, and distillation, while reserving large cloud models like Opus for complex tasks.

media r/LocalLLaMA · 4h ago

Gemma 4 12b needs glasses

A user reports frustration with Gemma 4's default image resolution settings, noting that the model struggles to decipher smaller text and larger compositional elements compared to competitors like Qwen 3.6.

media r/LocalLLaMA · 4h ago

Planning small AI RIG, 5 X 5060ti 16GB, after selling my 5090

A user on Reddit is asking for feedback on a plan to sell their Zotac Solid RTX 5090 with 128GB of RAM and replace it with five RTX 5060 Ti 16GB cards.

media r/LocalLLaMA · 4h ago

vibe shift: I can see this coming...

The provided source content consists solely of a Reddit post title and metadata without any accompanying article text or substantive information.

media r/LocalLLaMA · 4h ago

Reddit user proposes combining RTX 5080 and 4060 for local LLM inference

A Reddit user in the r/LocalLLaMA community is considering upgrading their hardware to improve inference speed and capacity for Qwen models by pairing a future RTX 5080 with their existing RTX 4060. The user aims to achieve at least 20-40 tokens per second while running Qwen 27B models, utilizing the combined 24GB of VRAM through tensor or layer splitting in llama.cpp or vLLm. They are evaluating this asymmetric dual-GPU setup against other options like the AMD R9700 AI Pro or 7900XTX, citing benchmark data that suggests limited performance gains for the AMD cards relative to their cost.

media r/LocalLLaMA · 4h ago

Interactive Explainer for Speculative Decoding and MTP

A user has published an interactive explainer on the topic of speculative decoding and Multi-Token Prediction (MTP). The resource is available via a link provided in the original submission.

media r/LocalLLaMA · 4h ago

Optimizing llama.cpp + Qwen 27B on RTX PRO 6000 Blackwell for coding agents

A user reports running Qwen3.6 27B MTP with llama.cpp on an RTX PRO 6000 Blackwell workstation to reduce reliance on Claude, noting the model is comparable to Sonnet but suffers from stability issues during coding sessions.

media r/LocalLLaMA · 4h ago

Reddit User Asks for Experiences with Ornith-1.0 9B Model

A Reddit user is inquiring whether others have tested the Ornith-1.0 9B model. The user specifically asks if they should consider using it instead of Qwen2.5-9B variants.

media r/LocalLLaMA · 4h ago

KLD is flawed in abliteration

A Reddit user argues that Kullback-Leibler divergence (KL) is a flawed metric for measuring the difference between an abliterated model and its base version. The author notes that KL can be represented in many ways, depends entirely on evaluation prompts, and is often manipulated via first-token KL to make models appear superior.

media r/LocalLLaMA · 4h ago

Does llama cpp split mode tensor cause issues?

A user reports that using tensor split mode in llama.cpp causes looping issues with tool calls and reasoning traces when running Qwen 27B and Gemma 4 26B (MoE) models across an RTX 5080 and two RTX 5060 Ti GPUs.

media r/LocalLLaMA · 4h ago

How long does your prompt processing actually take when resuming a long session?

A Reddit user is asking the community for data on how long it takes to resume coding agent sessions with long contexts of 100k tokens or more. The inquiry specifically targets users running these agents locally.

media r/LocalLLaMA · 4h ago

Impact of PCIe 5.0 x8/x4 vs x8/x8 on Dual GPU Inference

A user asks whether running dual GPUs in a PCIe 5.0 x8/x4 configuration instead of x8/x8 causes significant performance hits for LLM inference.

arxiv arXiv cs.CL · 4h ago

Compositionality and the lexicon in evolutionary semantics

This article introduces an evolutionary modeling framework that integrates formal semantics by allowing lexical meanings and composition functions to co-evolve under pressures for conceptual simplicity and communicative accuracy.

arxiv arXiv cs.CL · 4h ago

Bridging Talk and Thought: Understanding Dialogue Dynamics Across Collaborative Problem-Solving Contexts

This article presents a conceptual framework for analyzing dialogue dynamics in collaborative problem-solving contexts, with a specific focus on human-AI and multi-agent interactions. The authors argue that understanding these dialogic interactions is crucial for optimizing partnerships as intelligent systems gain autonomous reasoning capabilities.

arxiv arXiv cs.CL · 4h ago

LMs as Task-Specific Knowledge Bases: An Interpretability Analysis

This study investigates whether language models function as consistent knowledge bases by analyzing if facts acquired during one task remain accessible in others. The research reveals that LMs encode knowledge in a task-specific manner, with distinct parameter subsets underlying different tasks for the same fact.

arxiv arXiv cs.CL · 5h ago

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

The CARVE architecture addresses three critical defects in the leading GDN-2 delta-rule recurrent model by restricting erase operations to the key axis, thereby enabling valid WY-form triangular chunk solving and improving value efficiency. By reusing the recurrent output tensor as a content signal and replacing per-value write-gate projections with single scalars, CARVE maintains bit-identical initialization to GDN-2 while resolving memory-blind gating issues.

arxiv arXiv cs.CL · 5h ago

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

This article addresses the challenge of training-free source selection for large language models with shared vocabularies in scientific domains like SMILES and genomics, where classical metrics are either uninformative or computationally prohibitive. The authors demonstrate that representation similarity metrics are non-identifiable for transfer because models can share identical representations yet have orthogonal head updates.

arxiv arXiv cs.CL · 5h ago

How Surprising Is Historical Italian to Language Models? Tokenization Tax, Comprehension Tax, and a Simple Mitigation

This paper proposes a diagnostic framework decomposing historical language difficulty into tokenization cost, predictive uncertainty, semantic robustness, and context sensitivity. The authors evaluate this framework on 17th-century Italian, 19th-century Italian, and 18th-century Russian texts to understand how LLMs process historical languages.