Training methods — korshunov.ai

Training methods Page 1 / 13

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a small but statistically significant margin.

arxiv arXiv cs.AI · 20h ago

SciVerseGym: Reinforcement Learning Environment for Crystal Discovery

SciVerseGym introduces a Gymnasium-compatible environment that frames crystal discovery as a Markov decision process. It enables agents to perform chemically meaningful edits on atomic structures and receive feedback from configurable evaluators, supporting diverse actions and observation types with machine-learned potentials or ASE-compatible calculators.

arxiv arXiv cs.AI · 23h ago

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

The method combines a learnable world model with high- and low-level policies to enable safe exploration in long-horizon tasks. The high-level policy guides exploration toward safe subgoals, while the low-level policy uses imagined rollouts to prevent unsafe behaviors, outperforming existing Safe RL methods in success rate and constraint satisfaction across diverse tasks.

arxiv arXiv cs.AI · 23h ago

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Fed-CausalDiff introduces a federated causal diffusion framework that enables do-simulation in decentralized settings. It decomposes latent state evolution into global and local components, allowing decoupled synchronisation to reduce communication cost while maintaining accurate policy evaluation and ATE estimation.

arxiv arXiv cs.AI · 1d ago

Importance-Weighted On-Policy Distillation Addresses Position Bias

On-Policy Distillation (OPD) suffers from position bias where later tokens provide poor supervision. Importance-Weighted OPD (IW-OPD) assigns dynamic weights based on distribution discrepancy, prioritizing early tokens and suppressing late ones. IW-OPD converges faster and achieves up to 6.9 point performance gains on AIME-2025 compared to standard OPD.

arxiv arXiv cs.LG · 1d ago

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

ROVER enables reward-free pretraining by maximizing occupancy coverage in state space, using a learned world model to estimate occupancy without density or entropy estimation. It introduces a virtual sink state to balance exploration of known and unknown regions, achieving more uniform coverage and better downstream performance in tabular and pixel-based navigation tasks.

arxiv arXiv cs.LG · 1d ago

Central limit theorem for averaged Adam optimizer

The article establishes a central limit theorem for the averaged Adam optimizer, showing convergence at order n^{-1/2}. This rate matches classical stochastic approximation algorithms, with the covariance expressed in terms of the algorithm's properties at the attractor state.

arxiv arXiv cs.LG · 1d ago

BIPC Framework Accelerates Mixed-Integer Optimization with Machine Learning

The BIPC framework reduces solution time for large-scale mixed-integer programs by identifying a backdoor subset of variables that drive computational complexity. Using supervised learning, it predicts backdoor variable values and intervals, then solves a reduced problem with these predictions, achieving significant speedups with minimal quality loss. This enables rapid, high-quality solutions under parameter perturbations in real-world systems like power and supply chains.

arxiv arXiv cs.LG · 1d ago

Deep Learning with O(log N) Parallel Time Complexity

Hierarchical Block-Local Learning (HBLL) enables deep neural network training in O(log N) parallel time complexity, eliminating the need for full backpropagation. HBLL decomposes networks into hierarchically linked blocks and achieves competitive performance on vision and language tasks, with extensions to recurrent architectures.

arxiv arXiv cs.LG · 1d ago

Analytic Policy Gradients for Efficient Continuous Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

arxiv arXiv cs.LG · 1d ago

Muon Optimizer: Power, Limits, and a River-Valley Theory

A new trajectory-level theory reveals Muon accelerates early in optimization along the information-bearing river direction but converges slowly near the bottom, unlike gradient descent. With momentum, Muon's orthogonalized updates remove residual scale information, leading to overshooting and oscillation. The study advocates a two-stage approach—using Muon early and switching to gradient descent-like optimizers later—for improved LLM training performance.

arxiv arXiv cs.LG · 1d ago

GOMA Achieves First Stochastic Convergence Guarantee for Variational Inequalities

The paper introduces GOMA, a family of first-order methods for monotone variational inequalities. In the stochastic setting with unbounded variance, a simplified variant of GOMA achieves an O(1/\sqrt{k}) last-iterate convergence rate on the squared gradient norm, without variance reduction or growing batches. This is the first such guarantee for unconstrained stochastic monotone Lipschitz variational inequalities.

arxiv arXiv cs.LG · 1d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple episode termination from sampling loops. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.AI · 2d ago

Analytic Policy Gradients for Sample and Learning Efficient Control

arxiv arXiv cs.AI · 2d ago

Deep learning with O(log N) parallel time complexity

arxiv arXiv cs.AI · 2d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple sampling loops from individual episode terminations. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.CL · 2d ago

Energy Consumption Model for Transformer Training

A new framework models energy consumption in Transformer training on multiple GPUs. It uses BERT architectural sweeps to link measured energy to compute, memory traffic, and hardware efficiency proxies. The model, inspired by roofline analysis, includes a speedup-based hardware-efficiency factor and predicts training energy across diverse GPU configurations.

arxiv arXiv cs.CL · 2d ago

Randomized YaRN Improves Length Generalization for Long-Context Reasoning

Randomized YaRN enhances long-context reasoning by combining YaRN positional extrapolation with randomized positional encoding and a length curriculum. It outperforms standard fine-tuning on benchmarks like BABILong and MRCR, showing significant gains at far out-of-distribution context lengths.

arxiv arXiv cs.CL · 2d ago

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Adaptive Data Scheduling (ADS) introduces a dual-level data scheduling framework that replaces uniform sampling with adaptive distribution over semantic clusters and policy-boundary sample selection. Experimental results show ADS improves average accuracy by 5.2% over GRPO across three LLMs and seven reasoning benchmarks, demonstrating its effectiveness as a general strategy for LLM RL post-training.

arxiv arXiv cs.CL · 2d ago

Key Factors in RL for LLM Reasoning Revealed

A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.