Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 44

Causal Framework for Auditing Synthetic Data Disclosures

A new empirical auditing framework detects and classifies synthetic data disclosures as either true or phantom. It distinguishes direct reproductions of user data from incidental generation without model access or training, using only synthetic output and a held-out control set. The method provides tighter privacy leakage bounds than prior approaches and requires significantly fewer computational resources.

arxiv arXiv cs.AI · 9d ago

Low Frame Rate Degradation in Neural Audio Codecs

A quality cliff at 6.25 Hz in neural audio codecs is caused by insufficient training token exposure due to fixed clip duration. Correcting this training configuration enables smooth WER degradation down to 3.1 Hz and 1.6 Hz, indicating low frame rate efficiency is more achievable than previously thought.

arxiv arXiv cs.AI · 9d ago

Textual Reviews Have Limited Impact in Recommendation Models

A study finds that while textual review signals can be fused with collaborative data, their marginal contribution remains limited compared to collaborative signals in matrix factorization models. Adaptive fusion and cross-attention mechanisms improve representation flexibility, but do not significantly boost performance across datasets.

arxiv arXiv cs.AI · 9d ago

AI research documentation improves over decade

Analysis of 56,800 AI conference papers shows documentation practices improved from 2014 to 2024. Papers sharing both code and data increased from 11% to 64%, and estimated reproducibility rose from 28% to 64%. These improvements predate formal reproducibility checklists, indicating a broader shift toward open science.

arxiv arXiv cs.AI · 9d ago

Agentic LLM Framework for HTS Code Classification

A consensus-based agentic large language model framework is proposed for accurate 10-digit Harmonized Tariff Schedule code classification in Canadian maritime logistics. Evaluated on 3,300 expert-labeled product records, the framework shows that fine-grained HTS classification remains challenging for advanced LLMs, highlighting the need for evidence-grounded, uncertainty-aware, and human-in-the-loop workflows.

arxiv arXiv cs.AI · 9d ago

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM is a training-free, zero-shot framework that enhances SAM 3 for open-vocabulary semantic segmentation by identifying an image-conditioned active class set. It improves speed-accuracy tradeoff, outperforming SegEarth-OV3 by +1.4 mIoU on average and running up to 5.5x faster on large-vocabulary datasets, with strong robustness under image corruption.

arxiv arXiv cs.AI · 9d ago

Bayesian Audits Reveal Inconsistent AI Evaluation Timelines

Public AI evaluation archives show that a single terminal result can arise from two distinct pre-terminal histories, with estimated times to reach 95% of performance ceilings at 23.03 or 75.13. A candidate selection-aware frontier model fails synthetic recovery and uncertainty calibration, and is rejected by fixed audit gates. An archive-and-adjudication protocol verifies timing boundaries and falsifies unsupported frontier claims.

arxiv arXiv cs.AI · 9d ago

TuneJury: Open Metric for Music Generation Preference Alignment

TuneJury is an open, instance-level pairwise reward model that predicts music preference scores from text prompts and audio clips. It is trained on diverse human-preference data and demonstrates strong generalization, with anchor calibration enabling efficient post-hoc alignment for music generation systems.

arxiv arXiv cs.AI · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to preserve prompt cache continuity and minimize token footprint without introducing prefix mismatches.

arxiv arXiv cs.AI · 9d ago

HAMON: Passive Optical Forecasting for Long-Horizon Time-Series

HAMON uses passive optical components to perform long-horizon time-series forecasting, outperforming top digital models on ETTm2 across all horizons and on ETTh2 at all but the longest horizon. It achieves up to 14% lower MSE and relies on physical optical propagation without trainable digital layers, demonstrating that passive optical mixing can produce competitive forecasts.

arxiv arXiv cs.AI · 9d ago

Phase in Neural Representations: An Internal Oppenheim-Lim Test

Image classifiers like PRISM2D, GFNet, and ViT-B/16 show that phase, not magnitude, drives predictions in hidden layers. ResNet-50 reveals a latent sign code in late blocks, indicating phase/sign identity exists across architectures, though expressed differently due to activation and readout mechanisms.

arxiv arXiv cs.LG · 9d ago

Factorized Neural Operators Decompose Dynamic and Persistent Responses

Factorized Neural Operators (FaNO) decompose spectral representations into equivariant dynamic and invariant persistent responses. This factorized structure enables better interpretability, generalization, and consistent predictions across scales, domains, and physical regimes.

arxiv arXiv cs.LG · 9d ago

CEAP Reduces Variance in LLM Circuit Discovery

CEAP, a new circuit discovery method, substantially reduces resampling variance compared to EAP-IG. The paper shows that rephrasing variance arises from prompt templates activating different circuits, suggesting LLMs are inherently hard to steer across diverse inputs. Sample-wise variance is largely benign, as poor unfaithfulness scores result from selective contribution scaling, not circuit defects.

arxiv arXiv cs.LG · 9d ago

Adaptive Functional Gradient Descent with Convergence Guarantees

We propose a new functional gradient descent algorithm that adapts its representation during optimization. The method achieves convergence to a stationary point under smooth losses and to a global minimizer under smoothness and a Polyak-Lojasiewicz condition, despite using finite-dimensional approximations. It outperforms both fixed-approximation FGD and neural network baselines in regression, PDE solving, and computer vision tasks.

arxiv arXiv cs.LG · 9d ago

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

This paper proposes a unified causal-origin taxonomy for distributional shifts in reinforcement learning, linking ID/OOD generalization to non-stationary settings. It decomposes the agent-environment interaction using a POMDP framework, identifying internal, agent-driven, and external, environment-driven shifts, with explicit, implicit, and hybrid types defined by the shifted-time boundary. The work introduces an evaluation framework to measure shift impact through performance degradation and recovery metrics, enabling systematic analysis of RL robustness.

arxiv arXiv cs.LG · 9d ago

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

CircuitLasso enables scalable circuit learning in large language models by using sparse linear regression. It recovers circuits with structural accuracy matching state-of-the-art methods at significantly lower computational cost, and demonstrates human-interpretable semantic propagation through model components. The learned circuits achieve comparable performance on a domain-generalization task with reduced cost.

arxiv arXiv cs.LG · 9d ago

A nonparametric two-sample test using PReLU-IPM

The study introduces PReLU-IPM, a new integral probability metric based on a neural network discriminator with a single node. The resulting PReLU-TST test is nonparametric, consistent, and asymptotically equivalent to standard IPM-based tests, showing higher power or competitive performance on simulated and real datasets.

arxiv arXiv cs.LG · 9d ago

Causal Framework for Auditing Synthetic Data Disclosures

A model-agnostic auditing framework detects and distinguishes true and phantom disclosures in synthetic data. It uses only synthetic outputs and a held-out control set to perform statistical testing, offering tighter privacy leakage bounds than prior methods without requiring model access or additional training.

arxiv arXiv cs.LG · 9d ago

Hybrid Convolutional VAE for Crypto Volatility Surfaces

A convolutional variational autoencoder trained on 6,034 Binance Options surfaces for BTC and ETH achieves 0.94-1.56 vol-point RMSE under 10-50% masking. The hybrid predictor reduces error from 7.00 to 0.83 vol points at 50% masking, outperforming parametric re-fit in structured hole patterns and detecting abnormal market events without supervision.

arxiv arXiv cs.LG · 9d ago

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

A residual learning approach using directional task-error supervision achieves stable five-ball juggling on real robots, converging from the second attempt. The system outperforms human practice timelines and relies on both directional feedback and an informative prior, with a fixed-Jacobian Newton update proving most reliable.