Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

Factorized Neural Operators Decompose Dynamic and Persistent Responses

Factorized Neural Operators (FaNO) decompose spectral representations into equivariant dynamic and invariant persistent responses. This factorized structure enables better interpretability, generalization, and consistent predictions across scales, domains, and physical regimes.

arxiv arXiv cs.LG · 9d ago

CEAP Reduces Variance in LLM Circuit Discovery

CEAP, a new circuit discovery method, substantially reduces resampling variance compared to EAP-IG. The paper shows that rephrasing variance arises from prompt templates activating different circuits, suggesting LLMs are inherently hard to steer across diverse inputs. Sample-wise variance is largely benign, as poor unfaithfulness scores result from selective contribution scaling, not circuit defects.

arxiv arXiv cs.LG · 9d ago

Adaptive Functional Gradient Descent with Convergence Guarantees

We propose a new functional gradient descent algorithm that adapts its representation during optimization. The method achieves convergence to a stationary point under smooth losses and to a global minimizer under smoothness and a Polyak-Lojasiewicz condition, despite using finite-dimensional approximations. It outperforms both fixed-approximation FGD and neural network baselines in regression, PDE solving, and computer vision tasks.

arxiv arXiv cs.LG · 9d ago

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

This paper proposes a unified causal-origin taxonomy for distributional shifts in reinforcement learning, linking ID/OOD generalization to non-stationary settings. It decomposes the agent-environment interaction using a POMDP framework, identifying internal, agent-driven, and external, environment-driven shifts, with explicit, implicit, and hybrid types defined by the shifted-time boundary. The work introduces an evaluation framework to measure shift impact through performance degradation and recovery metrics, enabling systematic analysis of RL robustness.

arxiv arXiv cs.LG · 9d ago

CircuitLasso: Scalable Circuit Learning for LLM Interpretability

CircuitLasso enables scalable circuit learning in large language models by using sparse linear regression. It recovers circuits with structural accuracy matching state-of-the-art methods at significantly lower computational cost, and demonstrates human-interpretable semantic propagation through model components. The learned circuits achieve comparable performance on a domain-generalization task with reduced cost.

arxiv arXiv cs.LG · 9d ago

A nonparametric two-sample test using PReLU-IPM

The study introduces PReLU-IPM, a new integral probability metric based on a neural network discriminator with a single node. The resulting PReLU-TST test is nonparametric, consistent, and asymptotically equivalent to standard IPM-based tests, showing higher power or competitive performance on simulated and real datasets.

arxiv arXiv cs.LG · 9d ago

Causal Framework for Auditing Synthetic Data Disclosures

A model-agnostic auditing framework detects and distinguishes true and phantom disclosures in synthetic data. It uses only synthetic outputs and a held-out control set to perform statistical testing, offering tighter privacy leakage bounds than prior methods without requiring model access or additional training.

arxiv arXiv cs.LG · 9d ago

Hybrid Convolutional VAE for Crypto Volatility Surfaces

A convolutional variational autoencoder trained on 6,034 Binance Options surfaces for BTC and ETH achieves 0.94-1.56 vol-point RMSE under 10-50% masking. The hybrid predictor reduces error from 7.00 to 0.83 vol points at 50% masking, outperforming parametric re-fit in structured hole patterns and detecting abnormal market events without supervision.

arxiv arXiv cs.LG · 9d ago

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

A residual learning approach using directional task-error supervision achieves stable five-ball juggling on real robots, converging from the second attempt. The system outperforms human practice timelines and relies on both directional feedback and an informative prior, with a fixed-Jacobian Newton update proving most reliable.

arxiv arXiv cs.LG · 9d ago

Probabilistic Thinning Decouples Inference from State Updates

A new method decouples ML inference from state persistence in streaming systems using probabilistic thinning. It selectively triggers durable state updates based on event informativeness, reducing persistence path overhead by up to 90% without compromising downstream utility or introducing systemic errors.

arxiv arXiv cs.LG · 9d ago

Dynestyx: Probabilistic Programming for Dynamical Systems

Dynestyx is a probabilistic programming library that provides first-class support for state-space models. It enables users to specify arbitrary priors for discrete- or continuous-time dynamical systems, perform inference on mixed-effect data, and obtain state and parameter estimates with principled uncertainty quantification.

arxiv arXiv cs.LG · 9d ago

Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance

A compact spectral representation using Betti numbers, spectral gap, and analytic torsion distills persistent Laplacians into three mathematically grounded invariants. This approach captures essential predictive signals from the full spectrum, outperforms it in some cases, and reduces computational overhead on datasets like MNIST, QM-3D, and SKEMPI WT.

arxiv arXiv cs.LG · 9d ago

Multi-Center Benchmark for Abdominal Disease Diagnosis from Non-Contrast CT

A new multi-center benchmark enables abdominal disease diagnosis and report generation from non-contrast CT by synthesizing contrast-enhanced findings. The dataset includes paired NCCT-CECT studies and reports from two centers, showing NCCT achieves average multi-organ AUCs of 69.1% internally and 63.1% externally. The benchmark and code are publicly released to support research into safer, contrast-free abdominal imaging workflows.

arxiv arXiv cs.LG · 9d ago

ActiveSAM: Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM is a training-free, zero-shot framework that enhances SAM 3 for open-vocabulary semantic segmentation by identifying an image-conditioned active class set. It improves speed-accuracy tradeoff, outperforming SegEarth-OV3 by +1.4 mIoU on average and running up to 5.5x faster on large-vocabulary datasets, with strong robustness to image corruption.

arxiv arXiv cs.LG · 9d ago

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

arxiv arXiv cs.LG · 9d ago

PPAD-hardness for min-max optimization of quadratic polynomials

Computing approximate stationary points of min-max optimization over the hypercube is PPAD-hard for quadratic polynomials. This result holds even for multilinear polynomials where each variable appears in at most three monomials, with inverse polynomial approximation factors. As a consequence, two-team zero-sum polymatrix games are proven to be PPAD-hard.

arxiv arXiv cs.LG · 9d ago

TuneJury: Open Metric for Music Generation Preference Alignment

TuneJury is an open, instance-level pairwise reward model that predicts music preference scores from text prompts and audio clips. It is trained on diverse human-preference data and demonstrates strong generalization, with anchor calibration enabling efficient post-hoc alignment for music generation systems.

arxiv arXiv cs.LG · 9d ago

Neural EXposure Interaction Search for Interpretable HTE

NEXIS identifies causal heterogeneous treatment effects by discovering Markov-blankets in pre-treatment data. It leverages multi-modal, multi-view measurements and scalable representations with minimal human input, enabling interpretable and actionable policy insights from controlled experiments.

arxiv arXiv cs.LG · 9d ago

Filtered Conformal Ellipsoids for Graph-Native Time Series

A new method called filtered conformal ellipsoids provides prediction sets for multivariate time series by using a frozen state-space filter to generate predictive means and covariances, then applying split-conformal calibration to Mahalanobis scores. The approach achieves coverage under dependence through contraction in an observable predictive-law quotient, with theoretical bounds derived under Gaussian-projection and observability conditions, and shows sharper ellipsoids on graph-native traffic benchmarks compared to static and non-filter baselines.

arxiv arXiv cs.LG · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to stabilize prompt prefixes and manage context segments efficiently.