All articles — korshunov.ai

All articles Page 1 / 89

RoboMME-Interference: Benchmarking Robot Memory Under Interference

RoboMME-Interference introduces a cross-session benchmark to evaluate robot memory under interference. It adds unrelated sessions to prior demonstrations, revealing that perceptual memory variants degrade significantly as distractions increase, highlighting current systems' lack of robustness to interference and the need for long-context memory.

github llama.cpp · 11h ago

llama.cpp releases b9782 with new binaries and support

llama.cpp releases version b9782, including binaries for macOS, Linux, Android, Windows, and openEuler. The release adds support for Vulkan, OpenVINO, SYCL, ROCm, and CUDA across multiple architectures, with updated UI and disabled features such as KleidiAI and openEuler support.

lab Google DeepMind Blog · 11h ago

Gemini 3.5 Flash Adds Computer Use Capability

Google has introduced computer use in Gemini 3.5 Flash, enabling the model to execute code and interact with external tools. This feature allows users to run programming tasks and access real-time information through integrated computing functions.

arxiv arXiv cs.AI · 11h ago

Flow Annealing Posterior Sampling for Function-Space Regression and Inverse Problems

FAPS is the first function-space posterior sampling framework that unifies stochastic-process regression and PDE inverse problems. It uses pretrained flow-matching priors and Langevin correction with low-rank covariance preconditioning to enable efficient, accurate posterior inference from sparse, noisy data with coherent uncertainty quantification.

media r/LocalLLaMA · 11h ago

Has anyone else found vLLM outputs worse than llama.cpp?

A user reports noticing less reliable outputs from vLLM compared to llama.cpp, including formatting errors, context forgetting, and lower code quality. They ask whether such differences stem from quantization, chat templates, parser issues, or configuration errors, and seek confirmation if others have observed similar quality discrepancies between inference backends.

media r/LocalLLaMA · 11h ago

Sipp: Open-source library for in-browser inference built on llama.cpp

Sipp is an open-source library that enables in-browser inference using llama.cpp. It allows users to run local language model inference directly in web browsers without relying on cloud services. The project is available on GitHub at https://github.com/noumena-labs/Sipp.

arxiv arXiv cs.AI · 11h ago

Select-to-Act: Hierarchical RL with Adaptive Language Guidance

HRLLI introduces a hierarchical reinforcement learning framework that adapts natural-language instructions dynamically during decision-making. It decomposes instructions into stage-specific guidance elements and uses a select-to-act paradigm to enable real-time selection of relevant instruction pieces, improving sample efficiency and performance in complex environments.

arxiv arXiv cs.AI · 11h ago

SAFER: Reliable Test-Time Adaptation under Adversarial Streams

SAFER is a training-free framework that enhances robustness of test-time adaptation by using reliability-guided augmentation. It generates stochastic augmentations, pools predictions via correlation-weighted aggregation with outlier detection, and includes adaptive mixing to preserve clean performance under adversarial attacks. Evaluations on PACS, VLCS, and OfficeHome show improved resilience without sacrificing clean accuracy.

arxiv arXiv cs.AI · 11h ago

Sparsity-Storage-Accuracy Tradeoff in Parsimoniously Activated Dictionary Learning

Parsimoniously activated dictionary learning (PADL) establishes a structured generative model with auxiliary latent variables, enabling maximum a posteriori estimation. This framework provides generalization guarantees and an analytical characterization of the tradeoff between sparsity, storage cost, and reconstruction accuracy, allowing data-driven hyperparameter estimation. The resulting algorithm achieves better reconstruction performance and accelerates inference in vision-language models.

arxiv arXiv cs.AI · 11h ago

First-Token Broadcasters in Transformers: Language Identity and Robustness

LIHA reveals a small set of first-token broadcaster heads in GPT-2 that persistently attend to the initial prompt token, driving language switches. Instruction tuning reorganizes these circuits, concentrating language identity at early layers, as seen in Qwen2.5-1.5B-Instruct and confirmed in Chinese and Russian language handling at layer 0.

arxiv arXiv cs.AI · 11h ago

Reference-Free Assessment of Physical Consistency in Video Generation

A new method evaluates physical consistency in generated videos without requiring human voting or ground-truth references. It uses DROID-SLAM and SEA-RAFT to detect inconsistencies, improving task success rates by over 8% and enabling spatio-temporal localization of physical artifacts.

arxiv arXiv cs.AI · 11h ago

LLM-Assisted Label Cleaning in Chest CT Dataset

A large language model (LLM) assisted in identifying label-report discordance in the CT-RATE chest CT dataset. GPT-5.4 achieved 96.4% agreement with existing labels, with radiologist adjudication supporting LLM-derived labels in 74.2% of general and 91.9% of lymphadenopathy discordances. Multi-LLM majority-vote labels outperformed others in F1 score and kappa, and the cleaned dataset will be publicly released.

arxiv arXiv cs.AI · 11h ago

ARIA: A Causal-Aware Framework for Rescuing LLM Reasoning

ARIA addresses contextual tunneling in LLMs by conditioning knowledge use on mechanistic completeness. It uses a three-tier cascade for causal reasoning, physics-informed transfer, and parametric fallback, and improves materials discovery through auditable, physically grounded reasoning.

arxiv arXiv cs.AI · 11h ago

HyperAdapter: Structured Hyperedge Adaptation for Vision Transformer Fine-Tuning

HyperAdapter introduces a hypergraph-based adapter that performs structured, group-aware adaptation in vision transformers by operating in hyperedge space rather than token space. It uses prototype-based assignments to build a soft hypergraph, aggregates token features into hyperedge representations, applies lightweight adaptation, and diffuses updates back via hypergraph structure, enabling explicit structural inductive bias while maintaining efficiency. Experiments show consistent performance gains over baseline PEFT methods, especially on tasks requiring structured reasoning.

arxiv arXiv cs.AI · 11h ago

MetaPS: Adaptive Strategy Selection for Market Agents

MetaPS is a simulation-guided framework that enables market agents to adaptively select among programmatic strategies based on market states. It uses simulated markets to generate supervised training data, then selects strategies during inference to produce executable actions. Experiments show MetaPS outperforms fixed strategies and LLM-based agents, with compact models exceeding stronger API models in performance.

arxiv arXiv cs.AI · 11h ago

PlanBench-XL: Benchmark for Long-Horizon Tool-Use Planning

PlanBench-XL evaluates long-horizon planning in LLM agents across 1,665 tools through 327 retail tasks. It introduces a blocking mechanism to simulate real-world tool failures, revealing that agents like GPT-5.4 drop from 51.90% to 11.36% accuracy under severe disruptions, highlighting vulnerabilities in recovery and error handling.

arxiv arXiv cs.AI · 11h ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a small but statistically significant margin.

arxiv arXiv cs.AI · 12h ago

Gold Points Sniper: Self-guided Visual Reasoning for Fine-grained Action Understanding

Gold Points Sniper (GPS) enables lightweight vision-language models to perform self-guided multimodal reasoning for fine-grained human action understanding. By integrating a Gold Points Extractor, Selective Socratic Questioner, and Semantic Entailment Evaluator, GPS achieves performance comparable to GPT-4o while maintaining superior factual accuracy on CAP benchmark-based instruction-tuning data.

arxiv arXiv cs.AI · 12h ago

Structural Codebase Index Improves Resolve Without Cost Penalty

A structural codebase index in coding agents enhances localization and resolve performance without increasing cost per cell. It outperforms agentic-grep baselines in both metrics and achieves lower cost per solved task, especially in workloads with multi-file changes.

arxiv arXiv cs.AI · 12h ago

SciVerseGym: Reinforcement Learning Environment for Crystal Discovery

SciVerseGym introduces a Gymnasium-compatible environment that frames crystal discovery as a Markov decision process. It enables agents to perform chemically meaningful edits on atomic structures and receive feedback from configurable evaluators, supporting diverse actions and observation types with machine-learned potentials or ASE-compatible calculators.