Training methods — korshunov.ai

Training methods Page 1 / 14

Pooling GPUs to train a community model

A Reddit user asks whether anyone is successfully pooling GPUs to train a community model, highlighting challenges like latency and weight poisoning. The post questions if current distributed volunteer computing projects have achieved successful community model training.

arxiv arXiv cs.CL · 10d ago

Contrastive-Difference CKA Reveals Concept-Specific Alignment Across LLM Architectures

A training-free diagnostic, contrastive-difference CKA (CKA_Delta), identifies concept-specific structural alignment across language model architectures. It detects geometric convergence and functional transfer across six concept domains, including non-instructional tasks, with significant discrimination where standard CKA fails. Results suggest universality may strengthen with model scale, though further validation is needed.

arxiv arXiv cs.CL · 10d ago

Key Properties for Effective Code Interpreter Reasoning

A study identifies extrinsic (crucial tokens) and intrinsic (cognitive behaviors) properties that enhance code interpreter reasoning in large language models. Stronger reasoning models show higher prevalence of verification, backtracking, and backward chaining, with these properties improving performance during inference and training, reducing overthinking and boosting token efficiency.

arxiv arXiv cs.CL · 10d ago

DeepRubric: Efficient RL for Deep Research Agents

DeepRubric introduces a data construction framework that builds query-rubric pairs by first defining verifiable evaluation targets through an evidence tree. It generates 9K supervision examples and trains a 8B model with GRPO, achieving performance comparable to state-of-the-art models using 13x fewer RL GPU-hours.

arxiv arXiv cs.AI · 10d ago

MA-SBI: Calibration-Free SBI via Side-Channel Guidance

MA-SBI introduces a calibration-free simulation-based inference framework that uses side-channel text, like regime labels or instructions, to correct for simulator misspecification. It employs a learned corrector to apply observation-space shifts before posterior inference, without needing ground-truth parameter pairs or retraining. On hide-the-calibration benchmarks, MA-SBI matches the oracle posterior with text alone, outperforming RoPE under limited data, and shows robustness on real-world epidemiological and cognitive-science datasets.

arxiv arXiv cs.AI · 10d ago

Unified Causal-Origin Taxonomy for Distributional Shifts in RL

This paper introduces a unified causal-origin taxonomy that categorizes distributional shifts in reinforcement learning into internal, agent-driven, and external, environment-driven sources. It unifies ID/OOD generalization and non-stationary settings by framing shifts as structured changes in the agent-environment interaction process, using a POMDP decomposition and a shifted-time boundary perspective.

arxiv arXiv cs.AI · 10d ago

Low Frame Rate Degradation in Neural Audio Codecs

A quality cliff at 6.25 Hz in neural audio codecs is caused by insufficient training token exposure due to fixed clip duration. Correcting this training configuration enables smooth WER degradation down to 3.1 Hz and 1.6 Hz, indicating low frame rate efficiency is more achievable than previously thought.

arxiv arXiv cs.AI · 10d ago

PACT: Small Language Model Deliberation for Reactive Reinforcement Learning

PACT combines a reactive RL policy with a 2B-parameter Small Language Model to generate and validate action plans. The SLM plan is executed directly if verified as safe, feasible, and complete, bypassing the RL policy. PACT outperforms baselines on three increasingly difficult FrozenLake environments.

arxiv arXiv cs.LG · 10d ago

Hyperball Optimization for Faster Language Model Training

Hyperball is a simple optimizer wrapper that sets fixed Frobenius norms for weight matrices and their updates. It improves training speed and learning rate transfer in large models, achieving 20--30% token equivalent speedup over weight decay baselines on up to 1.2B parameter models.

arxiv arXiv cs.LG · 10d ago

Factorized Neural Operators Decompose Dynamic and Persistent Responses

Factorized Neural Operators (FaNO) decompose spectral representations into equivariant dynamic and invariant persistent responses. This factorized structure enables better interpretability, generalization, and consistent predictions across scales, domains, and physical regimes.

arxiv arXiv cs.LG · 10d ago

Adaptive Functional Gradient Descent with Convergence Guarantees

We propose a new functional gradient descent algorithm that adapts its representation during optimization. The method achieves convergence to a stationary point under smooth losses and to a global minimizer under smoothness and a Polyak-Lojasiewicz condition, despite using finite-dimensional approximations. It outperforms both fixed-approximation FGD and neural network baselines in regression, PDE solving, and computer vision tasks.

arxiv arXiv cs.LG · 10d ago

Unified Causal-Origin Taxonomy of Distributional Shifts in RL

This paper proposes a unified causal-origin taxonomy for distributional shifts in reinforcement learning, linking ID/OOD generalization to non-stationary settings. It decomposes the agent-environment interaction using a POMDP framework, identifying internal, agent-driven, and external, environment-driven shifts, with explicit, implicit, and hybrid types defined by the shifted-time boundary. The work introduces an evaluation framework to measure shift impact through performance degradation and recovery metrics, enabling systematic analysis of RL robustness.

arxiv arXiv cs.LG · 10d ago

Key Properties for Effective Code Interpreter Reasoning

arxiv arXiv cs.LG · 10d ago

A nonparametric two-sample test using PReLU-IPM

The study introduces PReLU-IPM, a new integral probability metric based on a neural network discriminator with a single node. The resulting PReLU-TST test is nonparametric, consistent, and asymptotically equivalent to standard IPM-based tests, showing higher power or competitive performance on simulated and real datasets.

arxiv arXiv cs.LG · 10d ago

SPaiK: Scalable Pairwise Kernel Learning with Stochastic Vec Trick

SPaiK introduces a scalable kernel learning method for pairwise settings using the stochastic generalized vec trick (sGVT). This innovation reduces computational and memory demands, enabling efficient training on large datasets and making pairwise kernel learning feasible for previously intractable data sizes.

arxiv arXiv cs.LG · 10d ago

PACT: Small Language Model Deliberation for Reactive Reinforcement Learning

PACT combines a reactive RL policy with a 2B-parameter Small Language Model to generate and validate action plans. The SLM plan is executed directly if verified in simulation, bypassing the RL policy without retraining. PACT outperforms baselines on three increasingly difficult FrozenLake environments.

arxiv arXiv cs.LG · 10d ago

Post-Hoc Falsification Operators Fail to Improve Accuracy in Small Code Models

A measurement study finds that 26 semantic post-hoc operators do not improve held-out accuracy over Best-of-N in frozen small code models. While some operators reduce compute usage or recover correct programs, none outperform BoN in accuracy, due to systemic limitations like coverage walls and consensus traps. An expression-layer recovery (M1) improves performance on HumanEval+ by 12 tasks, with no harm or leakage, and shows consistent results across model cells.

arxiv arXiv cs.LG · 10d ago

Residual Connections Mitigate Gradient Issues in Deep Networks

A study uses multiplicative ergodic theory to analyze exploding and vanishing gradients in deep neural networks. It shows that residual connections affect the Liapunov spectrum, as characterized by Furstenberg and Kifer, thereby stabilizing gradient flow during training.

arxiv arXiv cs.LG · 10d ago

ExpRL: Exploratory RL for LLM Mid-Training

ExpRL introduces a novel mid-training approach for LLMs using human-written question-answer data as reward scaffolds. Instead of imitating reference solutions, it constructs problem-specific grading rubrics to reward intermediate reasoning steps, enabling better initialization for sparse-reward RL and outperforming SFT, sparse-reward GRPO, and self-distillation on math reasoning tasks.

arxiv arXiv cs.LG · 10d ago

HABC Improves RL Fine-Tuning of VLAs with Sparse Outcomes

Hierarchical Advantage-Weighted Behavior Cloning (HABC) enhances online RL fine-tuning of vision-language agents by using separate critic heads for viability and efficiency. It combines their outputs via a state-adaptive gate and applies per-transition weights, while intervention-aware credit assignment prevents supervision leakage. In real-robot experiments, HABC boosts success rates to 92%, 88%, and 38% on three bimanual tasks, surpassing SFT baselines of 36%, 44%, and 12%.