Training methods — korshunov.ai

Training methods Page 1 / 12

UFP4: Uniform 4-Bit Training Overcomes Shrinkage Bias in LLM Pretraining

A study identifies shrinkage bias in E2M1-based FP4 formats due to geometric asymmetry, causing multiplicative error accumulation and training instability. The proposed UFP4 recipe uses uniform E1M2/INT4 grids and applies Random Hadamard Transform to all GEMMs, achieving lower loss degradation than E2M1 baselines in large-scale LLM pretraining. The authors recommend E1M2/INT4 as a first-class training primitive for future accelerators.

arxiv arXiv cs.AI · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.

arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.AI · 6d ago

Calibration in MoE Models Under Distribution Shift

This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving accuracy-calibration tradeoff across tasks and shifts.

arxiv arXiv cs.LG · 6d ago

Direct Advantage Estimation for Partially Observable Domains

Direct Advantage Estimation (DAE) is extended to partially observable domains with minimal modifications. A discrete latent dynamics model reduces computational overhead by efficiently approximating transition probabilities, enabling scalable and sample-efficient deep reinforcement learning in high-dimensional observation spaces.

arxiv arXiv cs.LG · 6d ago

Timestep Embeddings Unnecessary in Diffusion Models

A study shows diffusion models can achieve global minimizers without explicit timestep embeddings. Ablation studies on CelebA and CIFAR-10 reveal time-agnostic models maintain high fidelity and outperform conditioned ones in FID, precision, and recall.

arxiv arXiv cs.LG · 6d ago

DeepGaLA: Neural Surrogates with Uncertainty for PDE Inverse Problems

DeepGaLA is a neural-network surrogate that provides uncertainty-aware predictions for inverse problems in partial differential equations. It achieves accuracy comparable to Gaussian-process surrogates while maintaining efficiency in high-dimensional parameter spaces and incorporating differential-equation constraints.

arxiv arXiv cs.LG · 6d ago

Mechanistic Study of Representation Retention in Continual Learning

A synthetic framework reveals that superposition increases over time with transient dips at task boundaries, indicating boundary-specific interference. Higher feature sparsity promotes superposition without inevitable forgetting, provided representation strength is maintained. Task-level effective rank grows with sparsity, showing broader capacity usage under sparse conditions.

arxiv arXiv cs.LG · 6d ago

Two-Stage Evolutionary Hyperparameter Optimization for PINNs

A two-stage evolutionary strategy improves Physics-Informed Neural Network performance by first screening hyperparameter candidates via low-fidelity training, then refining top candidates with gradient-based optimization. The approach reduces mean error significantly across Advection, Klein-Gordon, and Helmholtz equation problems under fixed computational budgets.

arxiv arXiv cs.LG · 6d ago

Repurposing Speech Classifier for Diffusion-Based Generation

arxiv arXiv cs.LG · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.

arxiv arXiv cs.LG · 6d ago

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

This paper introduces Marginal Advantage Accumulation (MAA), a post-processing architecture that addresses cross-batch inconsistency in memory-driven agent self-evolution. MAA formalizes alignment and comparability as structural conditions, uses differential signals and exponential moving average to accumulate signed evidence per operation, and ensures traceability via semantic identity merging. It outperforms batch-level baselines in 14 out of 16 settings and reduces token consumption by about 75%.

arxiv arXiv cs.LG · 6d ago

Entropy Estimation in Multi-Qutrit Systems with Neural Networks

A study compares variational quantum algorithms and classical CNNs for von Neumann entropy estimation in multi-qutrit systems. CNNs achieve accurate, stable predictions with only 12.5% of full state tomography measurements, reaching 90th-percentile errors of 0.13-0.16 nats for four- and five-qutrit systems, showing systematic improvement with system size and robustness to noise.

arxiv arXiv cs.LG · 6d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.LG · 6d ago

Multi-Task Bayesian In-Context Learning Framework

A new multi-task in-context learning framework enables amortized hierarchical Bayesian inference by representing prior information as a prefix in datasets. The transformer model adapts predictions across prior families, matching oracle performance on diverse tasks while being significantly faster. It is validated on real-world spatiotemporal temperature prediction.

arxiv arXiv cs.LG · 6d ago

Calibration in MoE Models Under Distribution Shift

This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving the accuracy-calibration tradeoff across tasks and shifts.

arxiv arXiv cs.LG · 6d ago

Lie-Algebra Attention: Group Element Tokens in Neural Networks

Lie-Algebra Attention introduces attention tokens as matrix Lie group elements, using the closed-form algebra norm of relative poses as attention scores. This method achieves invariant, equivariant attention without representation-theoretic components, outperforming vector-token baselines on SE(2), SO(3), and Aff(2) with fewer parameters and no learned kernels.

arxiv arXiv cs.AI · 6d ago

Lean as Process-Verified Reward Oracle in RL for Theorem Proving

This work shows that Lean can serve as a symbolic process oracle, providing fine-grained, verified feedback during reinforcement learning. By parsing proof attempts into tactic sequences and using Lean's elaboration to mark sound steps and first failures, the system generates dense, type-theoretic reward signals. Experiments demonstrate tactic-level supervision outperforms outcome-only methods on benchmarks like MiniF2F and ProofNet, highlighting Lean's role as both evaluator and training reward source.

arxiv arXiv cs.AI · 6d ago

Learnable Global Merging for Variable-Length Tokenization in Diffusion Transformers

A novel variable-length tokenizer uses learnable global merging to enable cross-length representation alignment in diffusion models. This data-independent approach overcomes position-dependent semantics and improves the quality-compute trade-off on ImageNet 256×25-6 generation compared to prior methods.

arxiv arXiv cs.AI · 6d ago

Residual-Space Evolutionary Optimization via Flow-based Generative Models

A model-agnostic framework combines flow-based generative editing with evolutionary algorithms to enable data editing in non-differentiable settings. It operates in residual space, using self-pollination for local refinement and cross-pollination for broad exploration, validated on MorphoMNIST and crystal data to balance target alignment, instance preservation, and diversity.