NVIDIA — korshunov.ai

Lab · NVIDIA

A study evaluates GPU workload classification using only zero-overhead NVML telemetry. The classifier achieves 98.2% accuracy in identifying training workloads and 43-87% accuracy against adversarially disguised, unexpected workloads across 9 GPU models.

arxiv arXiv cs.LG · 9d ago

Geometric Action Model for Robot Policy Learning

The Geometric Action Model (GAM) enables robot policies to reason about 3D physical interactions by repurposing a pretrained geometric foundation model. GAM splits the GFM to serve as both an observation encoder and a causal future predictor, then routes predicted future geometry and actions through the same backbone, achieving accurate, robust, and efficient manipulation performance in simulation and real-robot benchmarks.

arxiv arXiv cs.LG · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.AI · 7d ago

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE introduces a system that partitions expert layers across workers to avoid full model replicas, reducing communication costs by up to 1.42x over efficient baselines and 45.44x over DDP. It achieves up to 1.4x throughput speedups via a skip-token mechanism and demonstrates stable routing, with projected benefits extending to 100B-scale models through system modeling.

arxiv arXiv cs.LG · 8d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory degrades with each write, forming a non-renewable asset. A wear-aware pricing model uses a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on whether task value increases with memory persistence. The sign of the value-write association $χ$ varies by deployment: positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation. The endurance budget is binding only on low-end QLC/eMMC memory, and while wear-aware routing aligns with task value, actual performance improvements remain unverified in data.

arxiv arXiv cs.LG · 8d ago

MGUP: Momentum-Gradient Alignment for Selective Optimization

MGUP introduces a selective update mechanism that applies larger step-sizes to a fixed proportion of parameters in stochastic optimization, while using smaller, non-zero step-sizes for the rest. It integrates seamlessly with optimizers like AdamW, Lion, and Muon, providing theoretical convergence guarantees for MGUP-AdamW and demonstrating superior or more stable performance in training large language models and MAE pretraining tasks.

arxiv arXiv cs.LG · 8d ago

AoiZora: Topology-Aware Auto-Parallel Optimization for Video Diffusion Inference

AoiZora is a compiler-mediated topology planner that improves low-latency video diffusion inference on TPU sub-slices. By aligning logical sharding with physical placement through the compilation flow, it reduces one-step denoising latency by up to 1.42x on TPU v5e sub-slices compared to existing methods.

arxiv arXiv cs.AI · 8d ago

Flash Endurance as Depreciating Capital in Robot Memory

A robot's flash memory endurance is a non-renewable asset that degrades with each write. A wear-aware pricing model introduces a shadow price $η$ to guide memory placement across RAM, NVM, and cloud, with optimal routing depending on the value-write association $χ$. Empirical measurements show $χ$ is positive in long-horizon manipulation, null in short-horizon tasks, and negative in teleoperation, and the endurance budget is binding only on low-end QLC/eMMC memory, where wear-aware control influences routing based on task value without improving performance.

arxiv arXiv cs.LG · 9d ago

TokenPilot: Cache-Efficient Context Management for LLM Agents

TokenPilot reduces inference costs by 61% to 87% in both isolated and continuous modes, outperforming prior systems in cost efficiency while maintaining competitive performance. It uses ingestion-aware compaction and lifecycle-aware eviction to stabilize prompt prefixes and manage context segments efficiently.

Zero-Overhead Telemetry Detects Hidden ML Training

Geometric Action Model for Robot Policy Learning

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

FoMoE Breaks Full-Replica Barrier with Partitioned Expert Layers

Flash Endurance as Depreciating Capital in Robot Memory

MGUP: Momentum-Gradient Alignment for Selective Optimization

AoiZora: Topology-Aware Auto-Parallel Optimization for Video Diffusion Inference

Flash Endurance as Depreciating Capital in Robot Memory

TokenPilot: Cache-Efficient Context Management for LLM Agents