Training methods — korshunov.ai

Training methods Page 1 / 12

MKAN: Monotonic Kolmogorov-Arnold Networks with Hard Monotonicity

MKAN introduces a Kolmogorov-Arnold Network with hard monotonicity guaranteed for all parameter values, achieved through exponential reparameterization, positive edge weights, and a monotone base activation. It enables standard gradient descent training and provides a representation-cost theorem showing that any feature extractor can be realized with monotone structure at a size no more than twice the original, offering a principled scaling rule for monotone encoders.

arxiv arXiv cs.LG · 8d ago

Dimensionality Controls When Modularity Helps in Continual Learning

Modular architecture enhances compositional continual learning only in low-dimensional regimes where representational subspaces partially align for similar tasks. In high-dimensional regimes, both modular and single networks perform similarly, indicating modularity's benefit depends on representational dimensionality induced by initialization scale.

arxiv arXiv cs.LG · 8d ago

KANLib: A Modular and Efficient Kolmogorov-Arnold Network Framework

KANLib introduces a modular, extensible, and computationally efficient framework for Kolmogorov-Arnold Networks. It unifies core concepts from PyKAN, EfficientKAN, and FastKAN, supporting adaptive grid rescaling and fine-grained architectural customization while maintaining PyTorch compatibility. Experiments on the California Housing dataset show KANLib achieves competitive efficiency and reproduces established KAN performance.

arxiv arXiv cs.LG · 8d ago

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

SoftMoE replaces discrete top-k routing with a differentiable soft top-k LapSum relaxation, enabling gradient-based optimization of expert selection. It learns to allocate expert activation non-uniformly across layers, with later layers activating more experts, while using significantly fewer experts than traditional sparse MoE.

arxiv arXiv cs.LG · 8d ago

Differential Privacy in Gaussian Process Posterior Sampling

Gaussian process posterior sampling inherently provides differential privacy due to its intrinsic randomness. Explicit Rényi-DP bounds show that privacy depends on ridge regularisation, with membership-inference attacks confirming the predicted leakage patterns. Adding calibrated GP noise enhances privacy while maintaining utility in downstream tasks.

arxiv arXiv cs.LG · 8d ago

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

C2FL is a distributed federated learning approach that enables nodes to self-organize into spatial clusters based on geographic proximity. It addresses temporal drift by combining experience replay with dwell-time-aware adaptive averaging, allowing nodes to maintain updated, region-specific knowledge while adapting to evolving environmental conditions.

arxiv arXiv cs.LG · 8d ago

BLITZ: Fast and Calibrated Nonparametric Conditional Independence Test

BLITZ introduces a two-stage regression method for nonparametric conditional independence testing. It first removes broad smooth dependencies using polynomial regression, then applies shallow tree regressions to residualize nonlinear features, enabling accurate and fast testing with improved null calibration compared to existing methods.

arxiv arXiv cs.AI · 8d ago

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR introduces a spatio-temporal reward allocation method for text-to-image generation, using attention maps to dynamically assign advantages across denoising steps. It improves semantic alignment, text rendering, and preference optimization in Stable Diffusion 3.5 Medium, achieving 0.9759, 0.9757, and 23.60 on GenEval, OCR, and PickScore respectively.

arxiv arXiv cs.AI · 8d ago

C2FL: Clustered Continual Federated Learning under Spatial and Temporal Drift

arxiv arXiv cs.AI · 8d ago

Catastrophic Forgetting is Low-Rank: A Function-Space Theory

A function-space theory reveals that catastrophic forgetting in continual adaptation concentrates in a small number of old-task NTK eigenmodes. In frozen-backbone linear-head PEFT-CL, the forgetting vector is exactly predictable up to numerical precision, with a Kronecker scaling rule for the vulnerable rank.

arxiv arXiv cs.AI · 8d ago

Volterra Generative Models Introduce Fractional Noise for Score-Based Generation

Volterra generative models propose a continuous-time score-based framework using fractional kernels to inject path-dependent noise, avoiding memoryless noising in traditional diffusion models. The approach employs finite-dimensional Markovian lifts and demonstrates improved generation on MNIST and CIFAR-10, with a bridge sampler enhancing stability for larger models.

arxiv arXiv cs.AI · 8d ago

S4oP: Operator-level Pruning for Efficient SSM Deployment

S4oP introduces an incremental, operator-level pruning method for S4 and S4D models, reducing inference cost by up to 70% while maintaining performance. The approach combines structured masking with fine-tuning and jointly tracks accuracy and latency, enabling efficient deployment of SSMs on resource-constrained devices.

arxiv arXiv cs.AI · 8d ago

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

The paper introduces a framework for multi-policy multi-objective reinforcement learning that learns a set of Pareto-optimal policies ensuring fairness across diverse user preferences. It proves fair policies remain within the convex coverage set for concave welfare functions like GGF and proposes three algorithms that incorporate non-stationary and stochastic policies to adapt to historical inequities. Empirical results show these methods effectively learn fair policies across multiple domains.

arxiv arXiv cs.AI · 8d ago

Ternary Mamba: Pretrained QAT for Efficient SSM Compression

Ternary Mamba achieves 3.61x compression of Mamba-2 using grouped quantization-aware training from a pretrained checkpoint, reducing memory from 2,687 to 744 MB. It reaches 48.1% zero-shot accuracy with only 102M tokens and 4 GPU-hours, matching Bi-Mamba within 0.9 percentage points, while revealing new instability from learnable quantization scales and error accumulation in recurrence.

arxiv arXiv cs.AI · 8d ago

Meta-Knowledge Reutilization in Reinforcement Learning

A new framework learns task-level knowledge on a simplified agent and transfers it to heterogeneous agents. It uses Bayesian non-parametric priors and a high-level policy to generate task guidance, with a semantic-magnitude interface and temporal adaptor to align meta-knowledge with embodiment-specific controllers. Experiments show 94.75% to 99.79% reduction in final-step tracking error and comparable performance using 23.8% of the interaction data of state-of-the-art methods.

arxiv arXiv cs.AI · 8d ago

Kolmogorov Regression for Robust Diffusion Policies

A backward Kolmogorov equation lifts diffusion policies to a Cameron-Martin space, replacing stochastic score matching with a deterministic PDE. This approach achieves convergence bounds tied to kernel effective rank, improved trajectory regularity, and a failure detector without rewards, showing 17% higher reward and 67.6% reduced drift on PushT, and 28.4% lower RMSE with perfect bottleneck detection on a manufacturing line. Hamilton-Jacobi theory reduces deadlock events by 96% in simulations.

arxiv arXiv cs.AI · 8d ago

FPRM: Fixed-Point Reasoning Model with Adaptive Compute

FPRM is a Transformer-based model that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. It adapts compute to task difficulty by leveraging fixed-point reasoning, outperforming baseline models on Sudoku, Maze, state-tracking, and ARC-AGI benchmarks.

arxiv arXiv cs.AI · 8d ago

Looped World Models Achieve 100x Parameter Efficiency

Looped World Models (LoopWM) introduce a looped architecture that iteratively refines latent environment states using a parameter-shared transformer. This approach achieves up to 100x parameter efficiency over conventional world models by adapting computation depth to each prediction's complexity.

arxiv arXiv cs.CL · 8d ago

Negative Token Filtering for Stable Single-Rollout RL

A new approach called negative token filtering enables stable single-rollout training by preventing false penalties on negative samples. The method improves performance on agentic tasks compared to group-based RL techniques, while matching group-based methods on reasoning tasks.

arxiv arXiv cs.CL · 8d ago

Expressivity Analysis of Hierarchical Modelling in Deep Transformers

This paper analyzes deep transformer expressiveness using bounded-depth grammars. It constructs transformers with positional attention where model depth scales linearly with grammar depth, and neuron count grows quadratically with production rules. The results support the linear representation hypothesis by showing these models can encode abstract grammatical states in low-dimensional, linearly separable subspaces.