Training methods — korshunov.ai

Training methods Page 1 / 12

Variance Reduction in Temporal Difference Learning

Temporal difference learning reduces variance by aggregating over multiple trajectories. The study shows TD variance is asymptotically bounded above by Monte Carlo estimators, and shorter horizon updates reduce variance for fixed samples. Direct Advantage Estimation acts as a regression-adjusted control variate, achieving tighter variance bounds than TD in large samples.

arxiv arXiv cs.CL · 6d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 6d ago

Bayesian Curriculum Learning on LLM Latent Manifolds

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a framework that models problem sampling as a structured bandit problem in LLMs' latent space. BMC organizes tasks into a hierarchical tree and uses Bayesian learning to guide sampling, revealing tradeoffs between learning signal, task diversity, and evaluation relevance. Prioritizing difficulty alone fails to achieve strong downstream performance, underscoring the need for structure and type-aware sampling.

arxiv arXiv cs.CL · 6d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to learn 'Connect the Dots' by using reinforcement learning with long rollout sequences. The method includes tailored tasks and environments to foster meta-capability development, showing strong cross-domain generalization and performance in out-of-distribution settings. Implementations are available at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.

arxiv arXiv cs.CL · 6d ago

Information-Theoretic Analysis of Effective Supervision in Latent Chain-of-Thought

This work identifies a dual collapse in latent reasoning: gradient attenuation and representational drift. It proposes Trajectory and Space Supervision, showing that generative reconstruction preserves information capacity better than geometric compression. The Unified Latent Probe measures mutual information between latent trajectories and reasoning steps, revealing an information-performance binding in reasoning accuracy.

arxiv arXiv cs.CL · 6d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.

media r/LocalLLaMA · 6d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media Latent Space · 7d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

arxiv arXiv cs.LG · 7d ago

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

Discriminator-Guided RL (DRL) uses a pretrained representation space to train a discriminator that separates real data from model-generated samples. Its logit is used as a reward in KL-regularized RL, aligning model outputs with visual and semantic realism without human preferences. DRL improves FID and semantic FD across models like SiT and JiT, and enhances the Pareto frontier between preference and fidelity.

arxiv arXiv cs.LG · 7d ago

Essential Subspace Merging for Multi-Task Learning

Essential Subspace Merging (ESM) reduces inter-task interference by focusing on principal directions of activation shifts. ESM++ extends this with dynamic expert selection via prototype-based routing, enabling efficient, training-free multi-task model merging.

arxiv arXiv cs.LG · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.

arxiv arXiv cs.LG · 7d ago

Batch Size Tradeoffs in Stochastic Momentum Methods

Stochastic momentum methods like HB and ASGD show distinct batch-size tradeoffs in compute efficiency and serial runtime. HB maintains SGD-level compute efficiency over a batch-size window up to a factor \sqrt{\kappa} larger than SGD's critical batch size, while ASGD improves small-batch efficiency for rapidly decaying spectra but sacrifices it for larger batches in exchange for reduced serial runtime.

arxiv arXiv cs.LG · 7d ago

AGDN: Solving Traveling Salesman Problem with Anisotropic Graph Diffusion

AGDN introduces a graph neural network framework that addresses topological priors and connectivity loss in TSP. It uses a MixScore transition matrix and anisotropic diffusion to enable efficient information exchange, outperforming existing methods across diverse problem sizes and distributions while maintaining competitive computation time. The implementation is available on GitHub.

arxiv arXiv cs.LG · 7d ago

Decision-Focused RL for EV Charging with Unknown Departure Times

A new decision-focused RL framework jointly trains a forecaster and charging controller to handle unknown EV departure times. By aligning forecast accuracy with downstream decision quality, the method achieves up to 14% higher total reward and a 55% reduction in unsupplied energy compared to standard RL approaches.

arxiv arXiv cs.LG · 7d ago

MAST Enables Selective Unlearning in RLVR-Induced Reasoning

MAST, a mechanism-guided unlearning method, achieves targeted forgetting of RLVR-induced reasoning with minimal collateral damage. On Qwen2.5-Math-1.5B and Qwen3-1.7B-Base, it significantly reduces MATH performance (45/150 to 37/15-0) while preserving GSM8K accuracy by +0.8 points and maintaining MATH retention at -0.5 points. Results hold across different seeds, objectives, and models, showing superior stability over full-parameter unlearning.

arxiv arXiv cs.LG · 7d ago

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE addresses policy entropy collapse in GRPO-based reinforcement learning by identifying entropy-critical token subsets via surprisal quantiles and reweighting their advantages. It maintains stable policy entropy across model scales and tasks, outperforming DAPO and other baselines by 4%-8% on AIME24 and AIME25, with consistent exploration-exploitation balance.

arxiv arXiv cs.LG · 7d ago

Graph Neural Networks Accelerate Algebraic Multigrid Pressure Solver

A graph neural network enhances algebraic multigrid solvers by predicting optimal polynomial coefficients for sparse pseudo-inverse operators. The method reduces V-cycle iterations and achieves wall-clock speedups of 4% to 37% across benchmarks, with robust performance on meshes up to 128 times larger than training data and on unseen industry problems like AirfRANS.

arxiv arXiv cs.LG · 7d ago

Large Language Gibbs for Structured Probabilistic Inference

Large Language Gibbs uses LLM conditional distributions as transition operators for iterative variable resampling. This method enables coherent, order-independent probabilistic inference by achieving a stationary distribution that balances local conditionals, offering a practical alternative to single-pass generation for structured reasoning tasks.

arxiv arXiv cs.LG · 7d ago

NeSyCat Torch: Differentiable Tensor Implementation for Neurosymbolic Learning

NeSyCat Torch provides a differentiable tensor implementation of categorical semantics for neurosymbolic learning, unifying classical, fuzzy, probabilistic, and neural systems under a single inductive truth definition. It outperforms LTN and DeepProbLog in speed and accuracy on MNIST addition, matching DeepStochLog's accuracy while operating within a uniform framework extendable to continuous probability via monad instantiation.