Training methods — korshunov.ai

Training methods Page 1 / 12

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.

media Hugging Face Forums · 1h ago Live

Niodoo: A Local Runtime for Hidden State Steering of Frozen LLMs

Jason Van Pham has released Niodoo, a local runtime designed to steer frozen large language models through their hidden states. The project aims to correct last-step errors by injecting noise or "physics forces" during inference to break token loops. This approach allows smaller models to improve performance without fine-tuning, targeting specific failure cases like the Llama strawberry prompt benchmark. The system generates its own telemetry tags and utilizes TDA analysis to monitor internal model states for looping behavior. Van Pham developed this tool organically through months of self-directed research and red-teaming, emphasizing reproducible results from pinned hashes. The code is available on GitHub under the repository Ruffian-L/niodoo-hidden-state-steering.

media Hugging Face Forums · 1h ago Live

Prompt Format Inquiry for Training Unsloth/Phi-3.5-mini-instruct

A user seeks advice on the optimal prompt formatting strategy for training the Phi-3.5-mini-instruct model using Unsloth. The inquiry contrasts maintaining a custom text format against utilizing a standard chat template for dataset preparation. The current implementation employs a function that structures data into '### Input:' and '### Output:' sections, appending an end-of-text token. This approach processes JSON-encoded input and output fields derived from a Hugging Face Dataset object. The provided example illustrates a complex structure involving financial insights, merchant names, dates, and transaction totals. The user intends to deploy the trained model via a custom API and requests guidance on whether to retain this format or switch to a chat template.

arxiv arXiv cs.CL · 1h ago Live

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.

arxiv arXiv cs.CL · 1h ago Live

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Sparse Mixture-of-Experts (MoE) architectures often struggle with low-resource languages due to cross-lingual routing divergence that limits expert sharing. To address this, researchers propose SARA, a framework that transfers specialized capabilities from high-resource anchor languages to low-resource ones. SARA aligns the internal routing distributions of MoE layers using a symmetric Jensen-Shannon divergence constraint rather than operating on output logits. This approach encourages mechanistic consistency in expert selection across different languages. The authors evaluated the method on two large language models across five low-resource languages and three benchmarks. Results show SARA outperforms standard instruction tuning, achieving gains of +0.8% on Qwen3-30B-A3B and +1.2% on Phi-3.5-MoE-instruct for Global-MMLU. These findings demonstrate that SARA effectively addresses performance bottlenecks in low-resource contexts.

media r/LocalLLaMA · 4h ago

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

Gefen is presented as a drop-in replacement for the AdamW optimizer, claiming an eightfold reduction in memory usage during training. The project includes a GitHub repository available at ndvbd/Gefen and a corresponding research paper hosted on arXiv under the identifier 2606.13894. This submission highlights Gefen's potential to optimize resource efficiency for machine learning workflows. The provided source material links directly to the technical documentation and codebase for further verification. No additional performance metrics or comparative benchmarks are detailed in the available text.

arxiv arXiv cs.AI · 10h ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a small but statistically significant margin.

arxiv arXiv cs.AI · 11h ago

SciVerseGym: Reinforcement Learning Environment for Crystal Discovery

SciVerseGym introduces a Gymnasium-compatible environment that frames crystal discovery as a Markov decision process. It enables agents to perform chemically meaningful edits on atomic structures and receive feedback from configurable evaluators, supporting diverse actions and observation types with machine-learned potentials or ASE-compatible calculators.

arxiv arXiv cs.AI · 14h ago

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

The method combines a learnable world model with high- and low-level policies to enable safe exploration in long-horizon tasks. The high-level policy guides exploration toward safe subgoals, while the low-level policy uses imagined rollouts to prevent unsafe behaviors, outperforming existing Safe RL methods in success rate and constraint satisfaction across diverse tasks.

arxiv arXiv cs.AI · 14h ago

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Fed-CausalDiff introduces a federated causal diffusion framework that enables do-simulation in decentralized settings. It decomposes latent state evolution into global and local components, allowing decoupled synchronisation to reduce communication cost while maintaining accurate policy evaluation and ATE estimation.

arxiv arXiv cs.AI · 16h ago

Importance-Weighted On-Policy Distillation Addresses Position Bias

On-Policy Distillation (OPD) suffers from position bias where later tokens provide poor supervision. Importance-Weighted OPD (IW-OPD) assigns dynamic weights based on distribution discrepancy, prioritizing early tokens and suppressing late ones. IW-OPD converges faster and achieves up to 6.9 point performance gains on AIME-2025 compared to standard OPD.

arxiv arXiv cs.LG · 16h ago

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

ROVER enables reward-free pretraining by maximizing occupancy coverage in state space, using a learned world model to estimate occupancy without density or entropy estimation. It introduces a virtual sink state to balance exploration of known and unknown regions, achieving more uniform coverage and better downstream performance in tabular and pixel-based navigation tasks.

arxiv arXiv cs.LG · 18h ago

Central limit theorem for averaged Adam optimizer

The article establishes a central limit theorem for the averaged Adam optimizer, showing convergence at order n^{-1/2}. This rate matches classical stochastic approximation algorithms, with the covariance expressed in terms of the algorithm's properties at the attractor state.

arxiv arXiv cs.LG · 18h ago

BIPC Framework Accelerates Mixed-Integer Optimization with Machine Learning

The BIPC framework reduces solution time for large-scale mixed-integer programs by identifying a backdoor subset of variables that drive computational complexity. Using supervised learning, it predicts backdoor variable values and intervals, then solves a reduced problem with these predictions, achieving significant speedups with minimal quality loss. This enables rapid, high-quality solutions under parameter perturbations in real-world systems like power and supply chains.

arxiv arXiv cs.LG · 19h ago

Deep Learning with O(log N) Parallel Time Complexity

Hierarchical Block-Local Learning (HBLL) enables deep neural network training in O(log N) parallel time complexity, eliminating the need for full backpropagation. HBLL decomposes networks into hierarchically linked blocks and achieves competitive performance on vision and language tasks, with extensions to recurrent architectures.

arxiv arXiv cs.LG · 19h ago

Analytic Policy Gradients for Efficient Continuous Control

Analytic Policy Gradients (APG) enables exact gradient computation via backpropagation through simulation when environment dynamics are differentiable. APG outperforms Proximal Policy Optimization (PPO) on four continuous control tasks, showing superior sample and learning efficiency with a segmented backpropagation scheme that reduces gradient degradation on long-horizon tasks.

arxiv arXiv cs.LG · 20h ago

Muon Optimizer: Power, Limits, and a River-Valley Theory

A new trajectory-level theory reveals Muon accelerates early in optimization along the information-bearing river direction but converges slowly near the bottom, unlike gradient descent. With momentum, Muon's orthogonalized updates remove residual scale information, leading to overshooting and oscillation. The study advocates a two-stage approach—using Muon early and switching to gradient descent-like optimizers later—for improved LLM training performance.

arxiv arXiv cs.LG · 20h ago

GOMA Achieves First Stochastic Convergence Guarantee for Variational Inequalities

The paper introduces GOMA, a family of first-order methods for monotone variational inequalities. In the stochastic setting with unbounded variance, a simplified variant of GOMA achieves an O(1/\sqrt{k}) last-iterate convergence rate on the squared gradient norm, without variance reduction or growing batches. This is the first such guarantee for unconstrained stochastic monotone Lipschitz variational inequalities.

arxiv arXiv cs.LG · 1d ago

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

FAST addresses sampling inefficiency in autonomous driving reinforcement learning by introducing Dynamic Parallel Sampling Alignment to decouple episode termination from sampling loops. It achieves up to 1.78 times wall-clock speedup over single-clip baselines while maintaining statistical unbiasedness through Scaled Mask-Padding Optimization.

arxiv arXiv cs.AI · 1d ago

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

Niodoo: A Local Runtime for Hidden State Steering of Frozen LLMs

Prompt Format Inquiry for Training Unsloth/Phi-3.5-mini-instruct

Space-Efficient Language Generation in the Limit

SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment

Gefen: A Drop-in Replacement for AdamW with Claimed 8x Memory Reduction

P4IR Framework Improves LLM-Based Code Compliance Accuracy

SciVerseGym: Reinforcement Learning Environment for Crystal Discovery

Imagine to Ensure Safety in Hierarchical Reinforcement Learning

Fed-CausalDiff: Decoupled Synchronization for Federated Do-Simulation

Importance-Weighted On-Policy Distillation Addresses Position Bias

Reward-free Pretraining for Reinforcement Learning via Occupancy Coverage Maximization

Central limit theorem for averaged Adam optimizer

BIPC Framework Accelerates Mixed-Integer Optimization with Machine Learning

Deep Learning with O(log N) Parallel Time Complexity

Analytic Policy Gradients for Efficient Continuous Control

Muon Optimizer: Power, Limits, and a River-Valley Theory

GOMA Achieves First Stochastic Convergence Guarantee for Variational Inequalities

FAST: A Framework for Aligned Sampling and Training in Parallel Reinforcement Learning

Analytic Policy Gradients for Sample and Learning Efficient Control