Training methods — korshunov.ai

Training methods Page 1 / 13

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.

arxiv arXiv cs.AI · 7d ago

Rescaling MLM-Head for Neural Sparse Retrieval

A study finds that large MLM-head norms in pretrained encoders degrade sparse retrieval performance in SPLADE. Introducing a simple initialization-time rescaling of the MLM-head stabilizes training and improves performance, matching or exceeding BERT-SPLADE in multiple benchmarks.

arxiv arXiv cs.AI · 7d ago

Reinforcement Learning Foundation Models Should Already Be A Thing

Reinforcement learning lacks foundation models despite synthetic MDPs being feasible. A proof-of-concept shows a single model trained on synthetic MDPs solves tabular benchmarks without tuning, outperforming existing methods in online settings and matching them offline.

arxiv arXiv cs.AI · 7d ago

Maturing Markov Decision Processes Introduce New Decision Framework

Maturing Markov Decision Processes (MMDPs) model the asymmetric evolution of information and action availability in sequential decisions. They introduce an expiring-action priority principle and a structure-aware reinforcement learning framework that improves learning efficiency, especially in complex and scalable decision problems.

arxiv arXiv cs.AI · 7d ago

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

Intelligence is embedded in the space itself, where scenes induce a Riemannian metric on configuration manifolds. A single Encoder-Router network uses semigroup-superposition to generate this metric, enabling zero-shot generalization across unseen obstacle configurations with large cost differences between collision-free and obstacle-penetrating paths.

arxiv arXiv cs.AI · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.AI · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4× while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism in real time, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.AI · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over efficient baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.AI · 7d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.

arxiv arXiv cs.AI · 7d ago

Pareto Q-Learning with Reward Machines

PQLRM is a multi-objective reinforcement learning algorithm that combines Pareto Q-Learning with Reward Machines to handle non-Markovian rewards. It converges faster than naive PQL on cross-product MDPs and generates Pareto-optimal policies beyond the capability of QRM.

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 8d ago

INI-VPINN: Physics-Informed Neural Network with Implicit Boundary Handling

INI-VPINN is a variational physics-informed neural network that implicitly enforces Neumann and interface conditions using compact support weighting functions and integration by parts. It achieves higher accuracy and faster convergence than existing PINN methods in solving multi-material problems with geometric singularities and mixed boundary conditions, and is publicly available on GitHub.

arxiv arXiv cs.LG · 8d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach introduces finite-dimensional Markovian lifts and proves squared error bounds, demonstrating improved generation on MNIST and potential for natural images, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.LG · 8d ago

Tensor-based Second-order Causal Discovery Algorithm

TSCD uses covariance matrices from observational and interventional data to identify causal structures in linear structural equation models on DAGs. It requires only uncorrelated noise and achieves identifiable causal orders and parameters with logarithmic intervention counts, scaling to hundreds of variables while remaining robust to noise and competitive with existing methods.

arxiv arXiv cs.LG · 8d ago

Edge Flow: A Continuous-Time Model for Gradient Descent at Edge of Stability

Edge Flow is a tractable, predictive continuous-time model that captures gradient descent dynamics at the edge of stability. It decomposes dynamics into center, oscillation direction, and magnitude, with self-stabilization of sharpness emerging from coupled feedback. The model requires only two gradient evaluations and one Hessian-vector product per iteration and outperforms prior models in tracking oscillations and explaining instabilities at EoS.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.LG · 8d ago

S4oP: Operator-level Pruning for Efficient SSM Deployment

S4oP introduces an incremental, operator-level pruning method for S4 and S4D models, reducing inference cost by up to 70% while maintaining predictive performance. The approach combines structured masking with fine-tuning and jointly tracks accuracy and latency, enabling efficient deployment of SSMs on resource-constrained devices.

arxiv arXiv cs.LG · 8d ago

Deep Reinforcement Learning for Minimum Zero-Forcing Sets

This paper proposes SD-ZFS, a deep reinforcement learning framework adapted from S2V-DQN, to solve the NP-hard minimum zero-forcing set problem on undirected graphs. The framework demonstrates strong performance compared to optimal solutions and greedy heuristics, showing effective generalization, scalability, and transfer across diverse graph structures.