Training methods — korshunov.ai

Training methods Page 1 / 13

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

The paper introduces a framework for multi-policy multi-objective reinforcement learning that learns a set of Pareto-optimal policies ensuring fairness across diverse user preferences. It proves fair policies remain within the convex coverage set for concave welfare functions and proposes three algorithms that incorporate non-stationary and stochastic policy dynamics. Empirical results show these methods effectively learn fair policies adaptable to varying user preferences.

arxiv arXiv cs.LG · 8d ago

Ternary Mamba: Efficient QAT of SSMs from Pretrained Checkpoints

Ternary Mamba achieves 3.61x compression of Mamba-2 from 2,687 to 744 MB using grouped quantization-aware training with knowledge distillation. It reaches 48.1% zero-shot accuracy on 7 tasks in 102M tokens, matching Bi-Mamba within 0.9 percentage points, while avoiding costly from-scratch training.

arxiv arXiv cs.LG · 8d ago

LiL-Q: Convex Method for Nonlinear PDEs with PINNs

A new convex quasilinearization method, LiL-Q, solves nonlinear PDEs by reducing them to linear subproblems using physics-informed neural networks. LiL-Q converges in single-digit iterations across seven benchmarks, achieving machine precision when the exact solution lies in the trial space, and requires up to two orders of magnitude fewer parameters than standard PINN solvers.

arxiv arXiv cs.LG · 8d ago

Diffusion Approximation for TD Learning with Linear Features

A stochastic differential equation model is introduced for linear TD(0) learning under Markovian noise. It separates contraction dynamics from sampling effects and explains the error floor via interaction between long-run covariance and the projected Bellman operator's geometry.

arxiv arXiv cs.LG · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction step. LoopWM establishes iterative latent depth as a new scaling dimension for world simulation.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

ConSA: Learnable Sparsity Control in Hybrid Attention

ConSA introduces a framework that learns optimal full vs. sliding-window attention allocation using L0 regularization and augmented Lagrangian constraints. It outperforms rule-based methods, with SWA placed in bottom layers and FA concentrated in middle-layer blocks, a pattern consistent across model scales and sparsity levels.

arxiv arXiv cs.CL · 8d ago

d-OPSD: On-policy Self-distillation for Diffusion LLMs

d-OPSD is the first on-policy self-distillation framework designed for diffusion LLMs. It uses self-generated answers as suffix conditioning and step-level supervision, enabling efficient post-training with only about 10% of RLVR's optimization steps while outperforming RLVR and SFT baselines on four reasoning benchmarks.

arxiv arXiv cs.CL · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction step, offering a new scaling dimension for world simulation.

arxiv arXiv cs.CL · 8d ago

ZPPO: Teacher in Prompts, Not Gradients

Zone of Proximal Policy Optimization (ZPPO) integrates teacher knowledge directly into prompts rather than policy gradients. It uses Binary and Negative Candidate-included Questions to surface student failure modes and amplifies learning through a prompt replay buffer, achieving superior performance on hard questions across student scales, especially at smaller model sizes.

arxiv arXiv cs.CL · 8d ago

Variable-Width Transformers Outperform Uniform Architectures

A new \times-shaped transformer architecture allocates varying layer widths, widening early and late layers while narrowing middle ones. It reduces average layer width, leading to 22% fewer FLOPs and 15% less KV cache memory, while outperforming uniform baselines on language modeling loss across 200M to 2B parameter models.

arxiv arXiv cs.LG · 8d ago

MGUP: Momentum-Gradient Alignment for Selective Optimization

MGUP introduces a selective update mechanism that applies larger step-sizes to a fixed proportion of parameters in stochastic optimization, while using smaller, non-zero step-sizes for the rest. It integrates seamlessly with optimizers like AdamW, Lion, and Muon, providing theoretical convergence guarantees for MGUP-AdamW and demonstrating superior or more stable performance in training large language models and MAE pretraining tasks.

arxiv arXiv cs.LG · 8d ago

ReLAR: Reinforcement-Guided Latent Refinement for Stable LLM Reasoning

ReLAR introduces a reinforcement-guided framework that iteratively refines hidden states to improve LLM reasoning stability. It uses learned depth and action controllers trained via policy gradients to adaptively determine refinement steps, achieving better accuracy and generation quality with lower inference overhead than explicit reasoning methods.

arxiv arXiv cs.LG · 8d ago

NMF with Topological Regularisation for Interpretable Bases

A new method integrates persistent homology into non-negative matrix factorisation to regularise the topology of basis functions. This approach enables spatially coherent image components, periodic time-series, and clique-like graph signals by using threshold-free topological scores as regularisers in the NMF objective.

arxiv arXiv cs.LG · 8d ago

CARLOS: Deep RL for Continuous-time Optimal Stopping

CARLOS uses an aggregate deep neural network to learn a joint space-time exercise boundary for optimal stopping problems. It progressively refines stopping decisions at finer time resolutions and employs adaptive sampling to focus training near the stopping boundary. Benchmarked results show CARLOS outperforms existing Bermudan solvers, approaching the American upper bound with high efficiency.

arxiv arXiv cs.LG · 8d ago

Reversal Q-Learning: A New Off-Policy RL Algorithm

Reversal Q-Learning (RQL) is a new off-policy reinforcement learning algorithm that trains a flow policy using prior data. By modeling flow refinement steps as actions in an expanded Markov decision process and applying virtual on-policy trajectories via reversal, RQL enables effective offline learning without backpropagation through time. Experiments on 50 robotic tasks show RQL achieves the best average performance among state-of-the-art flow-based offline RL methods.

arxiv arXiv cs.LG · 8d ago

SCBoost: Reducing Learner Redundancy via Residual Orthogonalization

SCBoost introduces residual orthogonalization to eliminate learner redundancy in boosting. It uses Spectral Residual Projection and Covariance-Regularized Weighting to ensure each learner captures novel error components and reduces ensemble correlations. Theoretical analysis and experiments show improved accuracy and F1 scores on ten benchmark datasets.

arxiv arXiv cs.LG · 8d ago

Credit-in-Event: Re-Anchoring Event Credit in Dynamics Models

A new method called Credit-in-Event identifies and addresses temporal credit dilution in learned dynamics models. CREST, a label-free and training-free readout, re-anchors pooled representations by estimating transient event cores and applying event-versus-rest contrast, reducing out-of-distribution error across multiple systems and data types. Ablations confirm the improvement stems from event-core credit re-anchoring, not generic locality or stability priors.

arxiv arXiv cs.LG · 8d ago

SelFix: Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

SelFix improves fixed-point inversion by selecting solutions that produce straighter inverse trajectories, enhancing real-image reconstruction and source-preserving editing. Experiments on FLUX.1-dev and PIE-Bench show it outperforms prior baselines in both reconstruction quality and editing fidelity.

arxiv arXiv cs.LG · 8d ago

Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

A new framework decomposes pre-hoc fine-tuning prediction risk into intrinsic limits and optimization variance. It proves a necessary lower bound on variance decay and introduces a budget-optimal probing strategy, validated across synthetic and real-world benchmarks through three distinct prediction regimes.