Training methods
arxiv arXiv cs.LG · 10d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 10d ago

Smoothness-Based Derandomization of PAC-Bayes Bounds

A new framework derandomizes PAC-Bayes bounds for smooth loss functions by analyzing the generalization gap of the Jensen gap class via Rademacher complexity. The resulting bounds for deterministic predictors involve flatness measures derived from Jacobians and Hessians of the score map, and are applied to linear models and smooth neural networks. A practical regularizer is proposed, computed using folded BatchNorm weights, and validated on CIFAR-10 with varying batch sizes.

arxiv arXiv cs.LG · 10d ago

Wasserstein Policy Learning for Distributional Outcomes

This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.

arxiv arXiv cs.LG · 10d ago

CAHP: Complementary Attention Head Pruning for Efficient Transformers

CAHP introduces a post-hoc framework that uses graph-theoretical clustering and information-theoretic measures to select complementary attention heads in Transformers. It automatically determines head retention without predefined sparsity, identifying a performance degradation threshold to ensure minimal model loss, and outperforms baselines in high-compression scenarios by preserving functionally critical heads in intermediate layers.