Training methods — korshunov.ai

Training methods Page 1 / 13

Frustrated Synchronization Network Outperforms Transformers

The Frustrated Synchronization Network (FSN) achieves lower validation loss than a RoPE-SwiGLU transformer at every epoch on character-level text and code tasks. At one million parameters, FSN converges to a validation loss of 1.5953 ± 0.0014, outperforming the transformer's converged loss of 1.611. This advantage persists up to four million parameters, with ongoing evaluations beyond that scale.

arxiv arXiv cs.CL · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.AI · 7d ago

R2D-RL: RoboCup 2D Soccer Environment for MARL

R2D-RL bridges RCSS2D and HELIOS-based clients with a Python MARL interface using shared-memory and cycle-level synchronization. It enables full-field and scenario-based training with configurable opponents, action masks, EPV-based reward shaping, and parallel execution, including front-goal scenarios and an 11-vs-11 benchmark with baseline results.

arxiv arXiv cs.AI · 7d ago

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.

arxiv arXiv cs.AI · 7d ago

Rescaling MLM-Head for Neural Sparse Retrieval

A study finds that large MLM-head norms in pretrained encoders degrade sparse retrieval performance in SPLADE. Introducing a simple initialization-time rescaling of the MLM-head stabilizes training and improves performance, matching or exceeding BERT-SPLADE in multiple benchmarks.

arxiv arXiv cs.AI · 7d ago

Reinforcement Learning Foundation Models Should Already Be A Thing

Reinforcement learning lacks foundation models despite synthetic MDPs being feasible. A proof-of-concept shows a single model trained on synthetic MDPs solves tabular benchmarks without tuning, outperforming existing methods in online settings and matching them offline.

arxiv arXiv cs.AI · 7d ago

Maturing Markov Decision Processes Introduce New Decision Framework

Maturing Markov Decision Processes (MMDPs) model the asymmetric evolution of information and action availability in sequential decisions. They introduce an expiring-action priority principle and a structure-aware reinforcement learning framework that improves learning efficiency, especially in complex and scalable decision problems.

arxiv arXiv cs.AI · 7d ago

Space Is Intelligence: Neural Semigroup Superposition for Riemannian Metric Generation

Intelligence is embedded in the space itself, where scenes induce a Riemannian metric on configuration manifolds. A single Encoder-Router network uses semigroup-superposition to generate this metric, enabling zero-shot generalization across unseen obstacle configurations with large cost differences between collision-free and obstacle-penetrating paths.

arxiv arXiv cs.AI · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.AI · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4× while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism in real time, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.AI · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over efficient baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.AI · 7d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.

arxiv arXiv cs.AI · 7d ago

Pareto Q-Learning with Reward Machines

PQLRM is a multi-objective reinforcement learning algorithm that combines Pareto Q-Learning with Reward Machines to handle non-Markovian rewards. It converges faster than naive PQL on cross-product MDPs and generates Pareto-optimal policies beyond the capability of QRM.

media r/LocalLLaMA · 8d ago

LoopCoder-V2: Two-Loop PLT Model Achieves Best Gain-Cost Trade-Off

LoopCoder-V2 is a 7B instruction-tuned code model based on Parallel Loop Transformer (PLT), trained on 18T tokens of mixed text and code data. The two-loop variant achieves the best gain-cost balance, improving SWE-bench Verified from 43.0 to 64.4, while three or more loops result in regression due to increasing positional mismatch and unstable updates.

arxiv arXiv cs.LG · 8d ago

Recursive Masked Diffusion Models Introduce New Scaling Axis

Recursive Masked Diffusion Models (R-MDMs) introduce recursive depth as a third scaling axis by reapplying a denoising transformer within each diffusion step. This recursion enables iterative output refinement without increasing parameter count, achieving performance comparable to non-recursive models with up to L times more parameters, where L is the number of iterations. R-MDMs also reduce inference compute by partially replacing denoising steps with recursive refinement.

arxiv arXiv cs.LG · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.LG · 8d ago

INI-VPINN: Physics-Informed Neural Network with Implicit Boundary Handling

INI-VPINN is a variational physics-informed neural network that implicitly enforces Neumann and interface conditions using compact support weighting functions and integration by parts. It achieves higher accuracy and faster convergence than existing PINN methods in solving multi-material problems with geometric singularities and mixed boundary conditions, and is publicly available on GitHub.

arxiv arXiv cs.LG · 8d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach introduces finite-dimensional Markovian lifts and proves squared error bounds, demonstrating improved generation on MNIST and potential for natural images, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.LG · 8d ago

Tensor-based Second-order Causal Discovery Algorithm

TSCD uses covariance matrices from observational and interventional data to identify causal structures in linear structural equation models on DAGs. It requires only uncorrelated noise and achieves identifiable causal orders and parameters with logarithmic intervention counts, scaling to hundreds of variables while remaining robust to noise and competitive with existing methods.