All articles — korshunov.ai

All articles Page 1 / 129

AI research documentation improves over decade

Analysis of 56,800 AI conference papers shows documentation practices improved from 2014 to 2024. Papers sharing both code and data increased from 11% to 64%, and estimated reproducibility rose from 28% to 64%. These improvements predate formal reproducibility checklists, indicating a broader shift toward open science.

arxiv arXiv cs.AI · 15d ago

Agentic LLM Framework for HTS Code Classification

A consensus-based agentic large language model framework is proposed for accurate 10-digit Harmonized Tariff Schedule code classification in Canadian maritime logistics. Evaluated on 3,300 expert-labeled product records, the framework shows that fine-grained HTS classification remains challenging for advanced LLMs, highlighting the need for evidence-grounded, uncertainty-aware, and human-in-the-loop workflows.

arxiv arXiv cs.AI · 15d ago

AI-Enabled Progress in Stable Menus of Public Goods

Experiments on EC 2025's 'Stable Menus of Public Goods' show that human-intuition prompts improve LLM performance and multi-turn interactions enhance ambitious steps. However, when compared to a first-year PhD student using an unpublished manuscript, the LLM is found to be slightly less effective.

arxiv arXiv cs.AI · 15d ago

PACT: Small Language Model Deliberation for Reactive Reinforcement Learning

PACT combines a reactive RL policy with a 2B-parameter Small Language Model to generate and validate action plans. The SLM plan is executed directly if verified as safe, feasible, and complete, bypassing the RL policy. PACT outperforms baselines on three increasingly difficult FrozenLake environments.

arxiv arXiv cs.AI · 15d ago

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM is a training-free, zero-shot framework that enhances SAM 3 for open-vocabulary semantic segmentation by identifying an image-conditioned active class set. It improves speed-accuracy tradeoff, outperforming SegEarth-OV3 by +1.4 mIoU on average and running up to 5.5x faster on large-vocabulary datasets, with strong robustness under image corruption.

arxiv arXiv cs.AI · 15d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.

arxiv arXiv cs.AI · 15d ago

TuneJury: Open Metric for Music Generation Preference Alignment

TuneJury is an open, instance-level pairwise reward model that predicts music preference scores from text prompts and audio clips. It is trained on diverse human-preference data and demonstrates strong generalization, with anchor calibration enabling efficient post-hoc alignment for music generation systems.

arxiv arXiv cs.AI · 15d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprint without introducing prefix mismatches.

arxiv arXiv cs.AI · 15d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.

arxiv arXiv cs.AI · 15d ago

HAMON: Passive Optical Forecasting for Long-Horizon Time-Series

HAMON uses passive optical components to perform long-horizon time-series forecasting, outperforming top digital models on ETTm2 across all horizons and on ETTh2 at all but the longest horizon. It achieves up to 14% lower MSE and relies on physical optical propagation without trainable digital layers, demonstrating that passive optical mixing can produce competitive forecasts.

arxiv arXiv cs.AI · 15d ago

Phase in Neural Representations: An Internal Oppenheim-Lim Test

Image classifiers like PRISM2D, GFNet, and ViT-B/16 show that phase, not magnitude, drives predictions in hidden layers. ResNet-50 reveals a latent sign code in late blocks, indicating phase/sign identity exists across architectures, though expressed differently due to activation and readout mechanisms.

media Latent Space · 15d ago

Satya Nadella on Loopcraft and Frontier Ecosystems

Microsoft CEO Satya Nadella introduces 'Loopcraft' as a new theory of the firm, emphasizing that the real opportunity in AI lies not in selecting the best model, but in building learning loops that compound human and token capital. He asserts that the priority must be creating frontier ecosystems where every organization can own and grow its institutional knowledge, enabling broad value flow across industries and countries.

media r/LocalLLaMA · 15d ago

Qwable-v1 Released as Distillation of Claude Fable-5

Qwable-v1, an open-weight model distilled from Anthropic's Fable-5, is now publicly available on Hugging Face. It captures 4,659 cleartext agentic-coding traces from Fable-5's public corpus and emits properly formatted <tool_use> XML calls to Claude-flavored tools, reflecting the original tool surface in its weights.

media r/LocalLLaMA · 15d ago

vLLM releases new streaming parser for Qwen3+ in nightly

vLLM has introduced a new streaming parser for Qwen3+ available in its nightly build, addressing issues like mid-turn stopping and failed streaming tool calls due to chunk boundaries. The update reportedly resolves these problems in limited testing, improving reliability for agentic workflows.

media r/LocalLLaMA · 15d ago

HalBench Tests 29 Open Source Models on Sycophancy and Hallucination

HalBench evaluates 29 open-source LLMs on a custom benchmark for sycophancy and hallucination. Qwen 3.6 and Gemma 4 outperform larger models, with Qwen 3.6 achieving 36.6% pushback—higher than GPT-5.4 and Gemini 3.1 Pro. Model size does not correlate with honest responses, indicating that architecture and training data matter more than parameters.

blog Simon Willison · 15d ago

Cloudflare CAPTCHA triggered only for searches with ampersand

Simon Willison configured Cloudflare's CAPTCHA to activate only for search queries containing at least one ampersand. The rule uses a custom filter: (http.request.uri.path wildcard r"/search/*" and http.request.uri.query contains "&"). This allows simple searches like /search/?q=lemur to pass without CAPTCHA.

media r/LocalLLaMA · 15d ago

Gemma3 270M Model Released on Reddit

A user posted an image of the Gemma3 270M model on the r/LocalLLaMA subreddit. The post includes a link to the image and comments section, indicating community discussion around the model.

blog Simon Willison · 15d ago

datasette-agent 0.3a0 releases with user approval for write SQL operations

datasette-agent 0.3a0 introduces the execute_write_sql tool that prompts users before writing to databases, ensuring permission checks are respected. The update also enhances datasette agent chat with user approval support, new command options like --unsafe for auto-approval, and plain text tool outputs for CLI display.