Training methods — korshunov.ai

Training methods Page 1 / 14

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.

arxiv arXiv cs.AI · 8d ago

Pareto Q-Learning with Reward Machines

PQLRM is a multi-objective reinforcement learning algorithm that combines Pareto Q-Learning with Reward Machines to handle non-Markovian rewards. It converges faster than naive PQL on cross-product MDPs and generates Pareto-optimal policies beyond the capability of QRM.

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 8d ago

INI-VPINN: Physics-Informed Neural Network with Implicit Boundary Handling

INI-VPINN is a variational physics-informed neural network that implicitly enforces Neumann and interface conditions using compact support weighting functions and integration by parts. It achieves higher accuracy and faster convergence than existing PINN methods in solving multi-material problems with geometric singularities and mixed boundary conditions, and is publicly available on GitHub.

arxiv arXiv cs.LG · 8d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach introduces finite-dimensional Markovian lifts and proves squared error bounds, demonstrating improved generation on MNIST and potential for natural images, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.LG · 8d ago

Tensor-based Second-order Causal Discovery Algorithm

TSCD uses covariance matrices from observational and interventional data to identify causal structures in linear structural equation models on DAGs. It requires only uncorrelated noise and achieves identifiable causal orders and parameters with logarithmic intervention counts, scaling to hundreds of variables while remaining robust to noise and competitive with existing methods.

arxiv arXiv cs.LG · 8d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.LG · 8d ago

S4oP: Operator-level Pruning for Efficient SSM Deployment

S4oP introduces an incremental, operator-level pruning method for S4 and S4D models, reducing inference cost by up to 70% while maintaining predictive performance. The approach combines structured masking with fine-tuning and jointly tracks accuracy and latency, enabling efficient deployment of SSMs on resource-constrained devices.

arxiv arXiv cs.LG · 8d ago

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

This paper proposes SD-ZFS, a deep reinforcement learning framework adapted from S2V-DQN, to solve the NP-hard minimum zero-forcing set problem on undirected graphs. The framework demonstrates strong performance compared to optimal solutions and greedy heuristics, showing effective generalization, scalability, and transfer across diverse graph structures.

arxiv arXiv cs.LG · 8d ago

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

The paper introduces a framework for multi-policy multi-objective reinforcement learning that learns a set of Pareto-optimal policies ensuring fairness across diverse user preferences. It proves fair policies remain within the convex coverage set for concave welfare functions and proposes three algorithms that incorporate non-stationary and stochastic policy dynamics. Empirical results show these methods effectively learn fair policies adaptable to varying user preferences.

arxiv arXiv cs.LG · 8d ago

Ternary Mamba: Efficient QAT of SSMs from Pretrained Checkpoints

Ternary Mamba achieves 3.61x compression of Mamba-2 from 2,687 to 744 MB using grouped quantization-aware training with knowledge distillation. It reaches 48.1% zero-shot accuracy on 7 tasks in 102M tokens, matching Bi-Mamba within 0.9 percentage points, while avoiding costly from-scratch training.

arxiv arXiv cs.LG · 8d ago

LiL-Q: Convex Method for Nonlinear PDEs with PINNs

A new convex quasilinearization method, LiL-Q, solves nonlinear PDEs by reducing them to linear subproblems using physics-informed neural networks. LiL-Q converges in single-digit iterations across seven benchmarks, achieving machine precision when the exact solution lies in the trial space, and requires up to two orders of magnitude fewer parameters than standard PINN solvers.

arxiv arXiv cs.LG · 8d ago

Diffusion Approximation for TD Learning with Linear Features

A stochastic differential equation model is introduced for linear TD(0) learning under Markovian noise. It separates contraction dynamics from sampling effects and explains the error floor via interaction between long-run covariance and the projected Bellman operator's geometry.

arxiv arXiv cs.LG · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction step. LoopWM establishes iterative latent depth as a new scaling dimension for world simulation.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

ConSA: Learnable Sparsity Control in Hybrid Attention

ConSA introduces a framework that learns optimal full vs. sliding-window attention allocation using L0 regularization and augmented Lagrangian constraints. It outperforms rule-based methods, with SWA placed in bottom layers and FA concentrated in middle-layer blocks, a pattern consistent across model scales and sparsity levels.

arxiv arXiv cs.CL · 8d ago

d-OPSD: On-policy Self-distillation for Diffusion LLMs

d-OPSD is the first on-policy self-distillation framework designed for diffusion LLMs. It uses self-generated answers as suffix conditioning and step-level supervision, enabling efficient post-training with only about 10% of RLVR's optimization steps while outperforming RLVR and SFT baselines on four reasoning benchmarks.