Topic · Evaluation & benchmarks
arxiv arXiv cs.LG · 10d ago

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

Hierarchical Advantage-Weighted Behavior Cloning (HABC) enhances online RL fine-tuning of vision-language agents by using separate critic heads for viability and efficiency. It combines their outputs via a state-adaptive gate and applies per-transition weights, while intervention-aware credit assignment prevents supervision leakage. In real-robot experiments, HABC boosts success rates to 92%, 88%, and 38% on three bimanual tasks, surpassing SFT baselines of 36%, 44%, and 12%.

arxiv arXiv cs.LG · 10d ago

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

arxiv arXiv cs.LG · 10d ago

Multi-Center Benchmark for Abdominal Disease Diagnosis from Non-Contrast CT

A new multi-center benchmark enables abdominal disease diagnosis and report generation from non-contrast CT by synthesizing contrast-enhanced findings. The dataset includes paired NCCT-CECT studies and reports from two centers, showing NCCT achieves average multi-organ AUCs of 69.1% internally and 63.1% externally. The benchmark and code are publicly released to support research into safer, contrast-free abdominal imaging workflows.

arxiv arXiv cs.LG · 10d ago

Filtered Conformal Ellipsoids for Graph-Native Time Series

A new method called filtered conformal ellipsoids provides prediction sets for multivariate time series by using a frozen state-space filter to generate predictive means and covariances, then applying split-conformal calibration to Mahalanobis scores. The approach achieves coverage under dependence through contraction in an observable predictive-law quotient, with theoretical bounds derived under Gaussian-projection and observability conditions, and shows sharper ellipsoids on graph-native traffic benchmarks compared to static and non-filter baselines.