Training methods — korshunov.ai

Training methods Page 1 / 15

SPIRAL: Learning to Search and Aggregate

The authors introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework that trains language models to utilize sequential, parallel, and aggregative reasoning primitives simultaneously. Unlike standard post-training methods that optimize only for single-trace sequential reasoning, SPIRAL unifies these components into a single inference compute pipeline. The model first samples independent traces in parallel using chain-of-thought reasoning and then generates a final aggregation trace conditioned on those inputs. This entire process is optimized end-to-end against the reward of the final aggregated response using set reinforcement learning and standard reinforcement learning techniques. Experiments on reasoning tasks demonstrate that SPIRAL effectively scales with inference compute resources. The approach outperforms GRPO by up to 11 times in scaling efficiency and achieves 15% higher performance when all three compute primitives are scaled.

arxiv arXiv cs.AI · 3d ago

Dual-Learned Matching Enables Linear Mode Connectivity for Billion-Parameter Transformers

Researchers propose a scalable framework to enable linear mode connectivity-based merging for billion-parameter pretrained transformers. Existing methods typically optimize interpolation paths from only one model endpoint, limiting scalability for large architectures. The new approach applies parameterized weight transformations to align functionally equivalent solutions and uses a dual learning procedure where both models jointly learn transformations toward a shared path. This bidirectional optimization substantially reduces interpolation barriers and improves merging reliability across large-scale models. Empirically, the method achieves near-zero loss barriers on WikiText for medium-sized language models. In vision tasks, ViT-L maintains above 69% ImageNet top-1 accuracy throughout the interpolation path. Modern billion-parameter LLMs exhibit only small loss barriers using this technique.

arxiv arXiv cs.AI · 3d ago

RECALL: Active Lifelong Learning for Vision-Language-Action Models

The paper introduces RECALL, an active, continual learning paradigm for Vision-Language-Action models that addresses the inefficiencies of passive imitation learning. Unlike traditional methods that require robot failures to trigger data collection, this approach uses uncertainty-guided recovery demonstrations to proactively identify states needing supervision. The authors demonstrate that this targeted data collection leads to more efficient fine-tuning compared to passively collected demonstrations. However, the study reveals that fine-tuning exclusively on this active recovery data causes catastrophic forgetting of previously learned behaviors. To mitigate this issue, the work evaluates continual learning techniques such as replay-based data mixing and elastic weight consolidation. These experiments highlight the critical tradeoffs between plasticity for new tasks and retention of existing capabilities in autoregressive VLAs. Ultimately, the research establishes that while uncertainty-guided recovery improves adaptation efficiency, incorporating targeted new data into large robot policies presents significant open challenges.

media Hugging Face Forums · 3d ago

Discussion on Cost-Effective Small Language Model Fine-Tuning in 2026

A recent discussion on the Hugging Face forums explores the most efficient methods for customizing small AI models for specific tasks. The thread, titled "What is the most cost-effective way to fine-tune a small language model in 2026?", seeks advice on minimizing expenses while maintaining performance. It was initiated by a single participant aiming to optimize their workflow for specialized applications. The inquiry highlights the growing interest in leveraging smaller models to reduce computational overhead. Participants are encouraged to share strategies that balance cost and efficiency in the current landscape. This topic reflects ongoing efforts to make model adaptation more accessible and affordable.

arxiv arXiv cs.AI · 3d ago

Learning Process Rewards via Success Visitation Matching for Efficient RL

The authors address the challenge of training reinforcement learning policies with inherently sparse outcome rewards, which leads to difficult credit assignment problems. They propose a method to transform these sparse rewards into dense process rewards by training a discriminator to distinguish between successful and unsuccessful episodes. This discriminator incentivizes the policy to match the state-action visitations of successful episodes while avoiding those of unsuccessful ones. By providing dense feedback on progress toward task completion, the approach provably achieves this without altering the optimal policy. The method is specifically applied to the finetuning of robotic control policies for manipulation tasks. Experimental results demonstrate significantly faster RL finetuning performance in both simulated and real-world environments compared to maximizing sparse outcome rewards alone.

arxiv arXiv cs.AI · 3d ago

Tapered Language Models: Improving Performance via Depth-Aware Capacity Allocation

Modern language models typically allocate parameters uniformly across identical layers, despite evidence that later layers primarily refine the residual stream rather than transform it. To address this asymmetry, researchers investigated whether parameter capacity should vary by depth under a fixed budget. Controlled experiments demonstrated that allocating more capacity to earlier layers and less to later layers improves perplexity compared to uniform baselines, while the reverse allocation degrades performance. Building on these results, the authors introduce Tapered Language Models (TLMs), an architectural principle where parameter-bearing components are monotonically tapered across depth. MLPs serve as the primary site for this instantiation due to their dominance in parameter count and clear width axis. The study tested tapering via a smooth cosine schedule across three model scales and four architectures, including Transformer, Gated Attention, Hope-attention, and Titans. Results show that TLMs consistently improve perplexity and downstream benchmark performance over uniform baselines without additional compute costs. These findings establish depth-aware capacity allocation as a simple, architecture-agnostic design lever for language models.

arxiv arXiv cs.AI · 3d ago

NVIDIA Nemotron Challenge: String Matching and Backtracking for Bit Manipulation Puzzles

This paper details algorithmic innovations developed for the NVIDIA Nemotron Model Reasoning Challenge, specifically targeting bit manipulation puzzles where models must deduce hidden logical rules. To address the combinatorial explosion of bitwise operations and LLM hallucinations, the authors abandon arithmetic logic in favor of string similarity and structured search. The core contribution reframes logic-gate deduction as a base-selection task using minimal bit flips to isolate primitive transformations. A backtracking depth-first search process is formalized to test candidates, detect logical collisions, and perform robust error recovery. Additionally, the method employs bit tokenization and interactive reasoning supervised fine-tuning with dynamic masking to simulate oracle feedback. Evaluated on these puzzles, the approach achieved over 96% validation accuracy. This performance secured the highest result in the category and a seventh-place finish in the overall contest.

arxiv arXiv cs.AI · 3d ago

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

AdamW serves as the standard optimizer for training large language models, yet its theoretical foundation remains largely confined to finite-variance regimes. This gap is significant because empirical evidence suggests that stochastic gradient noise during LLM pretraining typically exhibits heavy-tailed characteristics. Recent studies have demonstrated that sign-based optimizers like Lion and Muon achieve sharp convergence rates under heavy-tailed conditions, while AdaGrad also converges in this setting. However, rigorous convergence theory for AdamW has not yet been established within these heavy-tailed assumptions. The authors pose an open problem regarding whether AdamW can converge under the same heavy-tailed assumptions or if its second-moment accumulator creates a genuine obstruction. To address this, they formulate a positive weighted-metric benchmark and provide a corridor lower-bound mechanism. This mechanism illustrates how denominator memory in AdamW can effectively hide large gradients, potentially impacting its performance.

arxiv arXiv cs.LG · 3d ago

Encoder-Decoder Manifold Alignment for Idempotent Generation

Recent learning paradigms aim to enforce idempotency in generative models by ensuring repeated application leaves samples unchanged on the target data manifold. However, many existing approaches fail to achieve exact fixed points, resulting in instability and drift during repeated applications. The authors identify a geometric mismatch between encoder and decoder manifolds as the primary cause of this failure. To resolve this, they propose a training framework that explicitly aligns the geometry of both components to learn consistent representations of the same underlying data manifold. This alignment encourages stable projections and significantly reduces idempotency error compared to prior methods. Empirical results demonstrate that the approach consistently regenerates identical outputs under repeated application for both image generation and editing tasks. Furthermore, enforcing this type of idempotency improves identity preservation and information stability in generative models.

arxiv arXiv cs.LG · 3d ago

First Finite-Time Analysis of Classical Adam for Nonsmooth Nonconvex Optimization

This study presents the first finite-time convergence analysis for the classical Adam optimizer, specifically addressing its behavior in nonsmooth nonconvex optimization settings. Previous research largely ignored Adam's bias-correction term or required extra algorithmic modifications like clipping, leaving the original method's guarantees unclear. The authors utilize the Online-to-Nonconvex Conversion framework to prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{ rac{2}{13}}$. This theoretical result is significant because it applies to the modern heavy-tailed noise regime, which more closely reflects practical training conditions. Furthermore, the analysis establishes convergence under the parameter choice where $β_1=β_2$, aligning with recent empirical observations. These findings provide a rigorous explanation for Adam's effectiveness in real-world scenarios that were previously inadequately captured by smooth optimization theories.

arxiv arXiv cs.LG · 3d ago

Attention Sinks and Collapse Are Universal Consequences of Content-Based Routing

The study demonstrates that attention sinks, representation collapse, and norm stratification are not unique to transformer architectures but are inherent consequences of content-based routing under a fixed similarity metric. It establishes an identity showing softmax attention functions as Boltzmann-weighted aggregation over Euclidean distances with constant key norms, rendering it blind to key magnitude due to the omission of a specific norm term. This framework predicts that any router utilizing a metric ill-matched to its representations will compensate by concentrating routing and collapsing the routed representations. The authors validate this hypothesis across diverse models including nine pretrained transformers, graph attention networks, selective state-space models, recurrent mixers, and learned residual layers. Experimental results confirm that all tested architectures exhibit this identical signature of collapse regardless of their specific domain or structure. Furthermore, within-model ablations isolate the routing mechanism as the primary cause rather than incidental training dynamics. The onset of this phenomenon is shown to be contingent on the strength of the positional brake accompanying the content score, which can shift the effect across its range. However, the underlying mechanism remains invariant and does not require norm stratification, as routers with norm-normalized keys exhibit the same concentration behavior.

arxiv arXiv cs.CL · 3d ago

Multi-Step Tool-Use RL Collapse and Supervisory Fixes

Recent agentic reinforcement learning methods for large language models often suffer from instability or limited gains in tool-use tasks. Experiments reveal that some models experience catastrophic collapse, where performance drops abruptly and tool-invocation structures fail. Analysis shows these failures stem from unexpected probability spikes in specific control tokens that disrupt structured execution. Despite this disruption, the underlying tool-use capability remains intact but is obscured by specific formatting issues. To address this, the study investigates diverse supervisory signals including off-policy supervision and hint-based guidance under various training schemes. The authors find that interleaving supervised fine-tuning with reinforcement learning substantially improves stability during training. However, this approach exhibits degraded performance when evaluated on format and content out-of-distribution data. The results highlight the importance of understanding RL failures to enable robust training for complex multi-step tool-use tasks.

arxiv arXiv cs.CL · 3d ago

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

A study identifies 'natural ungrokking,' a phenomenon where small language models lose learned grammatical rules midway through pretraining despite the evidence remaining in the data. Researchers observed that a model learning pronoun-gender agreement with Sue collapsed from 0.94 accuracy to near zero by step 3,500 without any corresponding spike in the loss curve. The survival of these rules is determined by support frequency within the training stream, while the data-to-parameter ratio only modulates the depth of the collapse. This emergence-then-collapse dynamic was replicated across multiple corpora, budgets, and seeds, and confirmed in public Pythia checkpoints where collapse depth correlated with model scale. The forgetting process acts as a displacement mechanism where a competing surface pattern out-competes the rule, causing the log-probability margin to cross zero within 100 steps of behavioral failure. Control over this fate is asymmetric; while injecting counter-evidence can destroy rules via a monotone dose-response, restoring support even at 450 times the sustaining level fails to recover them.

arxiv arXiv cs.CL · 3d ago

iLLaDA: An 8B Masked Diffusion Language Model with Fully Bidirectional Attention

The authors introduce iLLaDA, an 8B parameter masked diffusion language model trained from scratch using fully bidirectional attention. This approach contrasts with the predominant autoregressive factorization and causal attention used in modern large language models. The model's pre-training scaled to 12 trillion tokens, followed by supervised fine-tuning on a 25 billion-token instruction corpus for 12 epochs. iLLaDA maintains the masked diffusion objective throughout both training phases and employs variable-length generation for efficiency. It also introduces confidence-based scoring to enhance performance on multiple-choice evaluation tasks. Benchmark results show significant improvements over its predecessor, LLaDA, including gains of 21.6 points on BBH and 14.9 points on ARC-Challenge for the base model. The instruction-tuned variant achieved increases of 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive nature, iLLaDA remains competitive with Qwen2.5 7B across several metrics.

arxiv arXiv cs.CL · 3d ago

Harness Design and Post-Training in LLM Agents

The article examines how tool harness design impacts the post-training of large language model agents. It argues that while agents are routinely post-trained, the scaffolding determining tool exposure is often treated as a fixed detail. Existing algorithms typically assume static environments, ignoring shifts in tools and tasks during deployment. To address this gap, the authors extended ALFWorld to treat harness design as a controllable dimension. This extension supports evaluation under both task and tool environment shifts. The study systematically analyzes harness influence on post-training in in-distribution and out-of-distribution settings. Results show that harness-aware post-training improves performance and enables robust adaptation to new environments. Conversely, minimal design effort leads to drastic performance drops under strong environmental shifts.

arxiv arXiv cs.CL · 3d ago

BiPACE: Bisimulation-Guided Policy Optimization with Action Counterfactual Estimation for LLM Agents

The authors identify a fundamental state-action credit mismatch in stepwise group-based RL for long-horizon LLM agents. Current estimators suffer from overly fine state partitioning and coarse action averaging, which violates equivalence assumptions for credit assignment. BiPACE is introduced as a drop-in advantage estimator that fixes these issues without adding critics or extra rollouts. It clusters steps by cosine distance in the actor's hidden-state geometry to reduce singleton groups and recenters returns using action-conditioned peer baselines. On ALFWorld with Qwen2.5-7B, BiPACE_Q raises validation success from 90.8 to 97.1±0.9, crossing the 95% threshold on every seed. It also improves performance on Qwen2.5-1.5B and achieves gains on WebShop and TextCraft over GRPO and GiGPO. The method incurs only 11.3% overhead of a single training-step wall time while changing the comparison unit to approximate behavioral equivalence.

arxiv arXiv cs.CL · 3d ago

OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning

The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.

media Hugging Face Forums · 3d ago

Niodoo: A Local Runtime for Hidden State Steering of Frozen LLMs

Jason Van Pham has released Niodoo, a local runtime designed to steer frozen large language models through their hidden states. The project aims to correct last-step errors by injecting noise or "physics forces" during inference to break token loops. This approach allows smaller models to improve performance without fine-tuning, targeting specific failure cases like the Llama strawberry prompt benchmark. The system generates its own telemetry tags and utilizes TDA analysis to monitor internal model states for looping behavior. Van Pham developed this tool organically through months of self-directed research and red-teaming, emphasizing reproducible results from pinned hashes. The code is available on GitHub under the repository Ruffian-L/niodoo-hidden-state-steering.

media Hugging Face Forums · 3d ago

Prompt Format Inquiry for Training Unsloth/Phi-3.5-mini-instruct

A user seeks advice on the optimal prompt formatting strategy for training the Phi-3.5-mini-instruct model using Unsloth. The inquiry contrasts maintaining a custom text format against utilizing a standard chat template for dataset preparation. The current implementation employs a function that structures data into '### Input:' and '### Output:' sections, appending an end-of-text token. This approach processes JSON-encoded input and output fields derived from a Hugging Face Dataset object. The provided example illustrates a complex structure involving financial insights, merchant names, dates, and transaction totals. The user intends to deploy the trained model via a custom API and requests guidance on whether to retain this format or switch to a chat template.

arxiv arXiv cs.CL · 3d ago

Space-Efficient Language Generation in the Limit

This study initiates a resource-aware theory of language generation in the limit under space efficiency constraints. A learner observes an adversarial positive stream from a target language K and must output a hallucination-free hypothesis L while omitting at most Δ strings. The research focuses on DFAs with s states over an alphabet of size k as the hypothesis class for memory-bounded learners. In the exponential-space regime, the authors prove that a learner can exactly identify the target language K. Under stricter memory budgets, they present a streaming algorithm using poly(s,k) space that converges to a hypothesis with a generation gap of Δ= O(k^{2s-2}). This learned hypothesis captures every string in K of length at least 2s-1. The results are complemented by a near-matching lower bound derived from communication complexity, showing that achieving Δ≤ k^{(1-ε)s} requires k^{Ω(εs)} memory. These findings reveal a sharp transition between polynomial-space generation and exponential-space exact identification.