Training methods — korshunov.ai

Topic · Training methods

TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.

arxiv arXiv cs.LG · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances LLM reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than large evolutionary systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

Frustrated Synchronization Network Outperforms Transformers

The Frustrated Synchronization Network (FSN) achieves lower validation loss than a RoPE-SwiGLU transformer at every epoch on character-level text and code tasks. At one million parameters, FSN converges to a validation loss of 1.5953 ± 0.0014, outperforming the transformer's converged loss of 1.611. This advantage persists up to four million parameters, with ongoing evaluations beyond that scale.

arxiv arXiv cs.CL · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.AI · 7d ago

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

SC-GRPO uses per-token KL divergence from self-conditioned trajectories to weight gradients in reinforcement learning. It outperforms GRPO by 8.1% and DAPO by 5.9% across math, code, and agentic tasks, with superior out-of-distribution performance and better results than OPD.

arxiv arXiv cs.AI · 7d ago

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Skill-MAS introduces a new approach that decouples experience retention from parametric updates by modeling orchestration as an evolvable Meta-Skill. It uses a closed-loop process involving multi-trajectory rollouts and selective reflection to distill reusable strategy principles, achieving strong performance gains and robust transferability across tasks and LLMs.

arxiv arXiv cs.AI · 7d ago

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

Spotlight enables DiT RL post-training by leveraging idle spot GPUs, reducing costs by 1.4-6.4× while achieving superior image quality. It uses stale model weights in exploration and reconfigures sequence parallelism in real time, allowing efficient GPU utilization without breaking training pipelines.

arxiv arXiv cs.AI · 7d ago

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS addresses sample depletion in multi-turn tool-use RL by using reward variance to detect capability boundaries. It synthesizes new data in real time, matching structural complexity of boundary samples, and maintains a dynamic replay buffer that co-evolves with the policy. RODS achieves performance comparable to a 17K-sample offline pipeline with 20x fewer trajectories.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.CL · 8d ago

SkillWeaver: Compositional Skill Routing for LLM Agents

SkillWeaver introduces a decompose-retrieve-compose framework for LLM agents, formalizing the Compositional Skill Routing problem. It achieves 67.7% decomposition accuracy via Iterative Skill-Aware Decomposition (SAD), improving from 51.0% with a p-value of less than 10^-6, and reduces context window usage by over 99%.

arxiv arXiv cs.CL · 8d ago

d-OPSD: On-policy Self-distillation for Diffusion LLMs

d-OPSD is the first on-policy self-distillation framework designed for diffusion LLMs. It uses self-generated answers as suffix conditioning and step-level supervision, enabling efficient post-training with only about 10% of RLVR's optimization steps while outperforming RLVR and SFT baselines on four reasoning benchmarks.

arxiv arXiv cs.CL · 8d ago

ZPPO: Teacher in Prompts, Not Gradients

Zone of Proximal Policy Optimization (ZPPO) integrates teacher knowledge directly into prompts rather than policy gradients. It uses Binary and Negative Candidate-included Questions to surface student failure modes and amplifies learning through a prompt replay buffer, achieving superior performance on hard questions across student scales, especially at smaller model sizes.

arxiv arXiv cs.LG · 8d ago

Reversal Q-Learning: A New Off-Policy RL Algorithm

Reversal Q-Learning (RQL) is a new off-policy reinforcement learning algorithm that trains a flow policy using prior data. By modeling flow refinement steps as actions in an expanded Markov decision process and applying virtual on-policy trajectories via reversal, RQL enables effective offline learning without backpropagation through time. Experiments on 50 robotic tasks show RQL achieves the best average performance among state-of-the-art flow-based offline RL methods.

arxiv arXiv cs.LG · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. When trained with GRPO, EnvRL improves success rates of Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop.

arxiv arXiv cs.LG · 8d ago

Lightweight Experiential Latent Memories for Continual Self-Improvement

A new method enables large language models to learn from their own reasoning traces without external supervision. By distilling inference-time computation into lightweight, modular latent memories, the model achieves performance competitive with full training and outperforms zero-shot and raw ICL baselines on mathematical reasoning tasks, with minimal computational overhead.

arxiv arXiv cs.AI · 8d ago

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR introduces a spatio-temporal reward allocation method for text-to-image generation, using attention maps to dynamically assign advantages across denoising steps. It improves semantic alignment, text rendering, and preference optimization in Stable Diffusion 3.5 Medium, achieving 0.9759, 0.9757, and 23.60 on GenEval, OCR, and PickScore respectively.

arxiv arXiv cs.AI · 8d ago

Meta-Knowledge Reutilization in Reinforcement Learning

A new framework learns task-level knowledge on a simplified agent and transfers it to heterogeneous agents. It uses Bayesian non-parametric priors and a high-level policy to generate task guidance, with a semantic-magnitude interface and temporal adaptor to align meta-knowledge with embodiment-specific controllers. Experiments show 94.75% to 99.79% reduction in final-step tracking error and comparable performance using 23.8% of the interaction data of state-of-the-art methods.

arxiv arXiv cs.CL · 8d ago

EnvRL: Leveraging Environment Dynamics in Agentic RL

EnvRL introduces a framework that enhances agentic reinforcement learning by incorporating environment dynamics through state prediction and inverse dynamics objectives. It achieves significant gains in success rates on long-horizon benchmarks, improving Qwen-2.5-1.5B-Instruct performance from 72.8% to 77.4% on ALFWorld and from 56.8% to 67.0% on WebShop when trained with GRPO.

arxiv arXiv cs.CL · 8d ago

LLM-Designed Training Environment for RL with Multi-Agent Reasoning

The LLM-as-Environment-Engineer framework uses LLMs to automatically redesign training environments in reinforcement learning by analyzing failure trajectories and contextual data. On the MAPF-FrozenLake testbed, it outperforms larger proprietary LLMs and fixed-environment baselines, with Qwen3-4B achieving the strongest aggregate performance. Analysis shows that failure evidence and preserved working configurations are key, and the current RL checkpoint performs better than the base model as an environment engineer.

TAPO: Self-Distillation with Micro-Reflective Trajectories

REVES: Augmented Training for Test-Time Scaling

Frustrated Synchronization Network Outperforms Transformers

REVES: Augmented Training for Test-Time Scaling

GraphPO: Graph-based Policy Optimization for Reasoning Models

Self-Conditioned Credit Assignment for RL with Verifiable Rewards

Skill-MAS: Evolving Meta-Skill for Automatic Multi-Agent Systems

Spotlight: Using Spot GPUs to Accelerate DiT RL Post-Training

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

Compositional Generalization in Language Model Reasoning

SkillWeaver: Compositional Skill Routing for LLM Agents

d-OPSD: On-policy Self-distillation for Diffusion LLMs

ZPPO: Teacher in Prompts, Not Gradients

Reversal Q-Learning: A New Off-Policy RL Algorithm

EnvRL: Leveraging Environment Dynamics in Agentic RL

Lightweight Experiential Latent Memories for Continual Self-Improvement

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

Meta-Knowledge Reutilization in Reinforcement Learning

EnvRL: Leveraging Environment Dynamics in Agentic RL

LLM-Designed Training Environment for RL with Multi-Agent Reasoning