Training methods — korshunov.ai

Training methods Page 9 / 14

Rubric-Conditioned Self-Distillation Framework

Rubric-Conditioned Self-Distillation introduces a framework that uses structured rubrics to provide fine-grained, token-level feedback during self-distillation of reasoning language models. By conditioning teacher models on rubric-level criteria, it enables more precise credit assignment than scalar rewards, outperforming GRPO and OPSD by 1.0 and 0.9 points on average across science reasoning benchmarks.

arxiv arXiv cs.AI · 10d ago

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based RL

UBP2 introduces a model-based method that actively explores environments by jointly reasoning over uncertainties in reward, dynamics, and value functions. It achieves superior sample efficiency in preference-based reinforcement learning, outperforming both model-free and non-optimistic model-based baselines on the Meta-World benchmark.

arxiv arXiv cs.CL · 10d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.CL · 10d ago

Large Language Gibbs for Structured Probabilistic Inference

Large Language Gibbs uses LLM conditional distributions as transition operators for iterative variable resampling. This method enables probabilistically coherent structured inference by avoiding order-dependent biases and achieving a stationary distribution that balances local conditionals. It demonstrates practical efficacy in synthetic distributions, consistent reasoning, and Bayesian structure learning.

arxiv arXiv cs.CL · 10d ago

Rubric-Conditioned Self-Distillation Framework

arxiv arXiv cs.LG · 10d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.LG · 10d ago

TAPO: Self-Distillation with Micro-Reflective Trajectories

TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.

arxiv arXiv cs.LG · 10d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.LG · 10d ago

Complexity Results for Binarized Neural Network Robustness Verification

The paper proves that satisfiability of binarized neural networks is NP-complete by reducing it to SAT. It also shows that uniform image occlusion leads to a piecewise-constant output structure, allowing polynomial-time robustness verification.

arxiv arXiv cs.LG · 10d ago

GrapNet: A Programmable Dynamic-Architecture Neural Graph Substrate

GrapNet introduces a programmable neural graph substrate where architecture edits are first-class operations. It outperforms dense MLPs on Split Fashion-MNIST and CIFAR-10, achieving 63.16% and 3.81% accuracy gains respectively, with statistically significant results.

arxiv arXiv cs.LG · 10d ago

Robust Sequential Conditional Independence Testing

A new method introduces adaptive betting with kernel statistics to test conditional independence, reducing Type I error inflation due to estimation error. It outperforms existing sequential Model-X approaches in both synthetic and real-world fairness tasks, maintaining high power while being more robust to distributional estimation errors.

arxiv arXiv cs.LG · 10d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.LG · 10d ago

Learnable Speech-to-Spike Encoder for Spiking Neural Networks

A learnable residual speech-to-spike encoder is jointly trained with a Recurrent Leaky Integrate-and-Fire network, achieving up to 94.97% accuracy on the Google Speech Commands v2 benchmark. A 35k-parameter version reaches 89.8%, outperforming prior methods with far fewer parameters, and shows task-aligned spike representations that improve class separability.

arxiv arXiv cs.LG · 10d ago

RL Reward Types Enhance Resilience in Cyber-Physical Systems

A study evaluates model-free reinforcement learning controllers in nonlinear systems under cyberattacks. Lyapunov reward offers best resilience with low tracking error, while Proximal Policy Optimization outperforms Deep Deterministic Policy Gradient in reducing KPI variance.

arxiv arXiv cs.LG · 10d ago

Structure-First Architectures for Dynamical Learning

A new paradigm for dynamical systems learning prioritizes structural design over nonlinear expressivity. The proposed wave-inspired dynamical units use explicit, causal interactions to form layered architectures that emerge hierarchical behavior and informative internal representations, even with minimal parameter optimization.

arxiv arXiv cs.LG · 10d ago

Smoothness-Based Derandomization of PAC-Bayes Bounds

A new framework derandomizes PAC-Bayes bounds for smooth loss functions by analyzing the generalization gap of the Jensen gap class via Rademacher complexity. The resulting bounds for deterministic predictors involve flatness measures derived from Jacobians and Hessians of the score map, and are applied to linear models and smooth neural networks. A practical regularizer is proposed, computed using folded BatchNorm weights, and validated on CIFAR-10 with varying batch sizes.

arxiv arXiv cs.LG · 10d ago

Wasserstein Policy Learning for Distributional Outcomes

This paper introduces offline policy learning for distribution-valued outcomes, where rewards are derived from utility functionals applied to Wasserstein barycenters. It establishes statistical guarantees using IPW and DR estimators, proving finite-sample regret with leading dependence \widetilde{\mathcal{O}}(\sqrt{\mathrm{N\text{-}dim}(\Pi)/N}) and provides a minimax lower bound confirming the sharpness of this rate.

arxiv arXiv cs.LG · 10d ago

Pareto Q-Learning with Reward Machines

PQLRM is a multi-objective reinforcement learning algorithm that combines Pareto Q-Learning with Reward Machines to handle non-Markovian rewards. It converges faster than a naive PQL baseline on cross-product MDPs and generates Pareto-optimal policies beyond the capability of QRM.

arxiv arXiv cs.LG · 10d ago

CAHP: Complementary Attention Head Pruning for Efficient Transformers

CAHP introduces a post-hoc framework that uses graph-theoretical clustering and information-theoretic measures to select complementary attention heads in Transformers. It automatically determines head retention without predefined sparsity, identifying a performance degradation threshold to ensure minimal model loss, and outperforms baselines in high-compression scenarios by preserving functionally critical heads in intermediate layers.

arxiv arXiv cs.CL · 10d ago

Distance-Adaptive Representation for Attention

A new attention mechanism, Distance-Adaptive Representation (DAR), assigns richer representations to nearby tokens and reduced dimensions to distant ones. This approach matches full-dimensional performance across multiple model scales and fine-tuning, outperforming uniform dimensionality reduction.