Training methods — korshunov.ai

Training methods Page 1 / 13

Off-Policy Evaluation for MNAR Rewards in MDPs

We propose an off-policy evaluation method for finite-horizon MDPs with rewards missing not at random. Our approach uses a reward-dependent propensity model and a bridge function to recover conditional mean rewards without modeling the MNAR mechanism, achieving consistency and finite-sample error bounds. Experiments on simulated and MIMIC-III Sepsis data show superior performance over existing methods.

arxiv arXiv cs.LG · 6d ago

Boundary Embedding Shaping for Graph Structural Disentanglement

Boundary Embedding Shaping (BES) addresses graph structural entanglement by selectively suppressing spurious neighbor correlations near class boundaries. BES uses adaptive contrastive learning to enhance boundary discrimination, improving GCN node classification by an average of 3.3% (up to 5.0% on WikiCS) and achieving superior link prediction accuracy.

arxiv arXiv cs.LG · 6d ago

SLiR: Shifting-based Linear Relaxations for Activation Functions

SLiR enables sound, tight linear relaxations of general activation functions using only Lipschitz constants or critical points. It achieves up to 7.8x more verification properties than state-of-the-art methods by efficiently computing upper and lower bounds via a shifting procedure.

arxiv arXiv cs.LG · 6d ago

Statistical Properties of Training and Generalization

The article examines deep learning's deviation from classical statistical intuitions, emphasizing neural scaling laws and their interaction with physical constraints and inductive biases in machine learning applications.

arxiv arXiv cs.LG · 6d ago

Model-Driven Approach for RL Environment Families

A model-driven approach generates families of reinforcement learning environments using a hybrid genetic algorithm. Environment variants are created through model transformations guided by a state-of-the-art model transformation engine, enabling scalable and error-resistant development. The method is validated in wildfire mitigation and curriculum learning scenarios.

arxiv arXiv cs.LG · 6d ago

Recurrent neural networks approximate continuous functions

A single ReLU recurrent neural network with fixed weights and hidden dimension can uniformly approximate any continuous function on [-1,1] as its runtime increases. This is achieved via a new model, the Turing machine with neural units (TMNU), which balances algorithmic flexibility with bounded simulation by RNNs. The convergence rates match polynomial approximation rates, and minimax lower bounds confirm that runtime is an essential, unavoidable resource.

arxiv arXiv cs.LG · 6d ago

QCPIKAN: Quantum-Classical Physics-Informed KAN for PDEs

QCPIKAN is the first quantum-classical physics-informed Kolmogorov-Arnold network designed to solve partial differential equations. It uses Chebyshev-polynomial KAN layers and parameterized quantum circuits to embed physical constraints into training, achieving exponential error convergence and reduced numerical dispersion. Validated on seepage scenarios in porous media, it outperforms existing quantum-classical neural networks in prediction accuracy, error control, and dynamic tracking.

arxiv arXiv cs.LG · 6d ago

Quantum Ring All-Reduce: Communication and Privacy Advantages for Distributed Learning

A quantum version of ring all-reduce reduces per-link communication by a factor of two using entanglement and superdense coding, without altering model or gradient computations. It achieves information-theoretically secure aggregation via verified entanglement, with a 2x overhead in GHZ copies, and provides exponential communication advantages in gradient conflict detection for specific auditing tasks.

arxiv arXiv cs.LG · 6d ago

Variance Reduction in Temporal Difference Learning

Temporal difference learning reduces variance by aggregating over multiple trajectories. The study shows TD variance is asymptotically bounded above by Monte Carlo estimators, and shorter horizon updates reduce variance for fixed samples. Direct Advantage Estimation acts as a regression-adjusted control variate, achieving tighter variance bounds than TD in large samples.

arxiv arXiv cs.CL · 6d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 6d ago

Bayesian Curriculum Learning on LLM Latent Manifolds

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a framework that models problem sampling as a structured bandit problem in LLMs' latent space. BMC organizes tasks into a hierarchical tree and uses Bayesian learning to guide sampling, revealing tradeoffs between learning signal, task diversity, and evaluation relevance. Prioritizing difficulty alone fails to achieve strong downstream performance, underscoring the need for structure and type-aware sampling.

arxiv arXiv cs.CL · 6d ago

Training LLMs for Long-Lifecycle Agents via Cross-Domain Generalization

A new framework enables large language models to learn 'Connect the Dots' by using reinforcement learning with long rollout sequences. The method includes tailored tasks and environments to foster meta-capability development, showing strong cross-domain generalization and performance in out-of-distribution settings. Implementations are available at https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod.

arxiv arXiv cs.CL · 6d ago

Information-Theoretic Analysis of Effective Supervision in Latent Chain-of-Thought

This work identifies a dual collapse in latent reasoning: gradient attenuation and representational drift. It proposes Trajectory and Space Supervision, showing that generative reconstruction preserves information capacity better than geometric compression. The Unified Latent Probe measures mutual information between latent trajectories and reasoning steps, revealing an information-performance binding in reasoning accuracy.

arxiv arXiv cs.CL · 6d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.

media r/LocalLLaMA · 7d ago

GLM-5.2 (744B, 2-bit) achieves 7.3 tok/s on 4×3090 with 192GB RAM

GLM-5.2 UD-IQ2_M runs at ~7.3 tokens per second on 4×RTX 3090s with 192GB DDR5 RAM using llama.cpp expert offload. Reducing quantization from IQ2 to IQ1 provided no speed gain, while increasing CPU threads from 6 to 12 improved performance by 22%. Decode is limited by CPU compute, not memory bandwidth, and the offloaded experts must be explicitly distributed across GPUs to avoid out-of-memory errors.

media Latent Space · 7d ago

Why AI Scaling Is a Systems Problem, Not Just a GPU Race

The AI scaling debate overlooks that maximizing model FLOP utilization is more critical than buying more GPUs. Frontiers like xAI operate at sub-10% MFU, while historical models achieved 21% to 70% MFU, indicating systemic inefficiencies in scheduling, networking, and cluster management. Anjney Midha argues that AI infrastructure must evolve into efficient, aligned, and responsible systems, with 'output maxing' emerging as a new discipline for frontier AI.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

arxiv arXiv cs.LG · 7d ago

Discriminator-Guided RL Corrects Flow Matching with Data-Aligned Rewards

Discriminator-Guided RL (DRL) uses a pretrained representation space to train a discriminator that separates real data from model-generated samples. Its logit is used as a reward in KL-regularized RL, aligning model outputs with visual and semantic realism without human preferences. DRL improves FID and semantic FD across models like SiT and JiT, and enhances the Pareto frontier between preference and fidelity.

arxiv arXiv cs.LG · 7d ago

Essential Subspace Merging for Multi-Task Learning

Essential Subspace Merging (ESM) reduces inter-task interference by focusing on principal directions of activation shifts. ESM++ extends this with dynamic expert selection via prototype-based routing, enabling efficient, training-free multi-task model merging.

arxiv arXiv cs.LG · 7d ago

Safety Reflection Pretraining for LLMs

Safety Reflection Pretraining inserts short safety reflections into pretraining data to enable self-monitoring in language models. Experiments with 1.7B models on FineWeb-Edu show improved safety accuracy and reduced attack success rates, with MedSafetyWorld demonstrating that the method better prevents unsafe behaviors from being generalized from safe data than data filtering or rewriting.