RTX 5090 MSI Power Usage and Cable Warning
The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.
The RTX 5090 MSI consumes 475-500W during inference or diffusion training. The user reports no issues with the power cable, emphasizing that it should not be bent to ensure safe and stable operation.
A long-context decode performance cliff on AMD Radeon AI PRO R9700 (RDNA4) was resolved by enabling AITER Unified Attention in vLLM 0.22.1. The fix involves relaxing a CDNA gate to include RDNA4, disabling other attention backends, and using bf16 KV cache, resulting in significant speedups across all context lengths. FP8 KV is ineffective on this hardware, and the model's native 262K context is fully achievable with bf16, offering ~2.9× concurrency without needing FP8.
EvoTensile uses evolutionary algorithms to tune GEMM kernels for AMD GPUs, improving NT layout performance from 20 to 40 TFLOPS on Strix Halo. This speedup represents a significant advance over unoptimized kernels, though it remains below the theoretical roofline of 59.4 TFLOPS.
A study identifies shrinkage bias in E2M1-based FP4 formats due to geometric asymmetry, causing multiplicative error accumulation and training instability. The proposed UFP4 recipe uses uniform E1M2/INT4 grids and applies Random Hadamard Transform to all GEMMs, achieving lower loss degradation than E2M1 baselines in large-scale LLM pretraining. The authors recommend E1M2/INT4 as a first-class training primitive for future accelerators.
A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.
UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.
This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving accuracy-calibration tradeoff across tasks and shifts.
Direct Advantage Estimation (DAE) is extended to partially observable domains with minimal modifications. A discrete latent dynamics model reduces computational overhead by efficiently approximating transition probabilities, enabling scalable and sample-efficient deep reinforcement learning in high-dimensional observation spaces.
A study shows diffusion models can achieve global minimizers without explicit timestep embeddings. Ablation studies on CelebA and CIFAR-10 reveal time-agnostic models maintain high fidelity and outperform conditioned ones in FID, precision, and recall.
DeepGaLA is a neural-network surrogate that provides uncertainty-aware predictions for inverse problems in partial differential equations. It achieves accuracy comparable to Gaussian-process surrogates while maintaining efficiency in high-dimensional parameter spaces and incorporating differential-equation constraints.
A synthetic framework reveals that superposition increases over time with transient dips at task boundaries, indicating boundary-specific interference. Higher feature sparsity promotes superposition without inevitable forgetting, provided representation strength is maintained. Task-level effective rank grows with sparsity, showing broader capacity usage under sparse conditions.
A two-stage evolutionary strategy improves Physics-Informed Neural Network performance by first screening hyperparameter candidates via low-fidelity training, then refining top candidates with gradient-based optimization. The approach reduces mean error significantly across Advection, Klein-Gordon, and Helmholtz equation problems under fixed computational budgets.
A pretrained speech classifier is repurposed as a backbone for guided diffusion-based speech generation. By attaching a lightweight subnetwork and training it under denoising score matching, the approach achieves high speech quality with reduced memory and computational cost, using a single model instead of two separately trained components.
UltraQuant introduces a 4-bit KV caching method tailored for context-heavy agent workloads. It achieves 3.47x reduction in P50 time-to-first-token in late rounds and 1.63x higher output throughput compared to FP8 KV caching, using FP8 queries, FP4 KV tensors, and native AMD CDNA4 scaled-MFMA support.
This paper introduces Marginal Advantage Accumulation (MAA), a post-processing architecture that addresses cross-batch inconsistency in memory-driven agent self-evolution. MAA formalizes alignment and comparability as structural conditions, uses differential signals and exponential moving average to accumulate signed evidence per operation, and ensures traceability via semantic identity merging. It outperforms batch-level baselines in 14 out of 16 settings and reduces token consumption by about 75%.
A study compares variational quantum algorithms and classical CNNs for von Neumann entropy estimation in multi-qutrit systems. CNNs achieve accurate, stable predictions with only 12.5% of full state tomography measurements, reaching 90th-percentile errors of 0.13-0.16 nats for four- and five-qutrit systems, showing systematic improvement with system size and robustness to noise.
Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.
A new multi-task in-context learning framework enables amortized hierarchical Bayesian inference by representing prior information as a prefix in datasets. The transformer model adapts predictions across prior families, matching oracle performance on diverse tasks while being significantly faster. It is validated on real-world spatiotemporal temperature prediction.
This paper examines how mixture-of-experts models maintain calibration under distribution shift. It finds that expert-level calibration ensures overall model calibration in hard-routed models but is insufficient for soft-routed models. The authors propose adversarial reweighting to penalize calibration errors in routed aggregates, improving the accuracy-calibration tradeoff across tasks and shifts.
Lie-Algebra Attention introduces attention tokens as matrix Lie group elements, using the closed-form algebra norm of relative poses as attention scores. This method achieves invariant, equivariant attention without representation-theoretic components, outperforming vector-token baselines on SE(2), SO(3), and Aff(2) with fewer parameters and no learned kernels.