Training methods — korshunov.ai

Training methods Page 1 / 12

Diffusion Approximation for TD Learning with Linear Features

A stochastic differential equation model is introduced for linear TD(0) learning under Markovian noise. It separates contraction dynamics from sampling effects and explains the error floor via interaction between long-run covariance and the projected Bellman operator's geometry.

arxiv arXiv cs.LG · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction step. LoopWM establishes iterative latent depth as a new scaling dimension for world simulation.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

ConSA: Learnable Sparsity Control in Hybrid Attention

ConSA introduces a framework that learns optimal full vs. sliding-window attention allocation using L0 regularization and augmented Lagrangian constraints. It outperforms rule-based methods, with SWA placed in bottom layers and FA concentrated in middle-layer blocks, a pattern consistent across model scales and sparsity levels.

arxiv arXiv cs.CL · 8d ago

d-OPSD: On-policy Self-distillation for Diffusion LLMs

d-OPSD is the first on-policy self-distillation framework designed for diffusion LLMs. It uses self-generated answers as suffix conditioning and step-level supervision, enabling efficient post-training with only about 10% of RLVR's optimization steps while outperforming RLVR and SFT baselines on four reasoning benchmarks.

arxiv arXiv cs.CL · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction step, offering a new scaling dimension for world simulation.

arxiv arXiv cs.CL · 8d ago

ZPPO: Teacher in Prompts, Not Gradients

Zone of Proximal Policy Optimization (ZPPO) integrates teacher knowledge directly into prompts rather than policy gradients. It uses Binary and Negative Candidate-included Questions to surface student failure modes and amplifies learning through a prompt replay buffer, achieving superior performance on hard questions across student scales, especially at smaller model sizes.

arxiv arXiv cs.CL · 8d ago

Variable-Width Transformers Outperform Uniform Architectures

A new \times-shaped transformer architecture allocates varying layer widths, widening early and late layers while narrowing middle ones. It reduces average layer width, leading to 22% fewer FLOPs and 15% less KV cache memory, while outperforming uniform baselines on language modeling loss across 200M to 2B parameter models.

arxiv arXiv cs.LG · 8d ago

MGUP: Momentum-Gradient Alignment for Selective Optimization

MGUP introduces a selective update mechanism that applies larger step-sizes to a fixed proportion of parameters in stochastic optimization, while using smaller, non-zero step-sizes for the rest. It integrates seamlessly with optimizers like AdamW, Lion, and Muon, providing theoretical convergence guarantees for MGUP-AdamW and demonstrating superior or more stable performance in training large language models and MAE pretraining tasks.

arxiv arXiv cs.LG · 8d ago

ReLAR: Reinforcement-Guided Latent Refinement for Stable LLM Reasoning

ReLAR introduces a reinforcement-guided framework that iteratively refines hidden states to improve LLM reasoning stability. It uses learned depth and action controllers trained via policy gradients to adaptively determine refinement steps, achieving better accuracy and generation quality with lower inference overhead than explicit reasoning methods.

arxiv arXiv cs.LG · 8d ago

NMF with Topological Regularisation for Interpretable Bases

A new method integrates persistent homology into non-negative matrix factorisation to regularise the topology of basis functions. This approach enables spatially coherent image components, periodic time-series, and clique-like graph signals by using threshold-free topological scores as regularisers in the NMF objective.

arxiv arXiv cs.LG · 8d ago

CARLOS: Deep RL for Continuous-time Optimal Stopping

CARLOS uses an aggregate deep neural network to learn a joint space-time exercise boundary for optimal stopping problems. It progressively refines stopping decisions at finer time resolutions and employs adaptive sampling to focus training near the stopping boundary. Benchmarked results show CARLOS outperforms existing Bermudan solvers, approaching the American upper bound with high efficiency.

arxiv arXiv cs.LG · 8d ago

Reversal Q-Learning: A New Off-Policy RL Algorithm

Reversal Q-Learning (RQL) is a new off-policy reinforcement learning algorithm that trains a flow policy using prior data. By modeling flow refinement steps as actions in an expanded Markov decision process and applying virtual on-policy trajectories via reversal, RQL enables effective offline learning without backpropagation through time. Experiments on 50 robotic tasks show RQL achieves the best average performance among state-of-the-art flow-based offline RL methods.

arxiv arXiv cs.LG · 8d ago

SCBoost: Reducing Learner Redundancy via Residual Orthogonalization

SCBoost introduces residual orthogonalization to eliminate learner redundancy in boosting. It uses Spectral Residual Projection and Covariance-Regularized Weighting to ensure each learner captures novel error components and reduces ensemble correlations. Theoretical analysis and experiments show improved accuracy and F1 scores on ten benchmark datasets.

arxiv arXiv cs.LG · 8d ago

Credit-in-Event: Re-Anchoring Event Credit in Dynamics Models

A new method called Credit-in-Event identifies and addresses temporal credit dilution in learned dynamics models. CREST, a label-free and training-free readout, re-anchors pooled representations by estimating transient event cores and applying event-versus-rest contrast, reducing out-of-distribution error across multiple systems and data types. Ablations confirm the improvement stems from event-core credit re-anchoring, not generic locality or stability priors.

arxiv arXiv cs.LG · 8d ago

SelFix: Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

SelFix improves fixed-point inversion by selecting solutions that produce straighter inverse trajectories, enhancing real-image reconstruction and source-preserving editing. Experiments on FLUX.1-dev and PIE-Bench show it outperforms prior baselines in both reconstruction quality and editing fidelity.

arxiv arXiv cs.LG · 8d ago

Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

A new framework decomposes pre-hoc fine-tuning prediction risk into intrinsic limits and optimization variance. It proves a necessary lower bound on variance decay and introduces a budget-optimal probing strategy, validated across synthetic and real-world benchmarks through three distinct prediction regimes.

arxiv arXiv cs.LG · 8d ago

Learnable Graph Patches for Feature Heterogeneity

We propose learnable graph patches as the smallest semantic units in graph data to address feature heterogeneity without textual information. Our framework uses patch encoders and aggregators to extract and combine knowledge across domains, enabling universal pre-training and improved downstream performance with more pre-training data.

arxiv arXiv cs.LG · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. When trained with GRPO, EnvRL improves success rates of Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop.

arxiv arXiv cs.LG · 8d ago

Confusion-Aware Transfer Teacher Curriculum Learning Framework

A confusion-aware difficulty score is introduced within the Transfer Teacher framework to improve model interpretability and data efficiency. Evaluations on CIFAR-10 show that confusion-aware curriculum ordering outperforms random ordering by up to 8.7% at 20% data, demonstrating consistent data-efficiency gains. However, curriculum or anti-curriculum ordering does not improve accuracy over standard training at full data, indicating that scoring function improvements alone are insufficient to overcome curriculum learning failure modes.