All articles
arxiv arXiv cs.AI · 4h ago

G$^3$VLA: Geometric inductive bias for Vision-Language-Action Models

The authors propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of pretrained Vision-Language-Action models without altering their action space or imitation objective. This approach combines intrinsic-conditioned ray embeddings, projective positional encoding, and bidirectional cross-view fusion to address the mismatch between 2D image coordinates and robot camera geometry.

arxiv arXiv cs.AI · 4h ago

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool is a serving engine designed for cold Mixture-of-Experts (MoE) models that disaggregates FFN weights and KV-cache into separate GPU memory pools to address memory inefficiencies in sparse request scenarios. By consolidating static weights and dynamically provisioning active KV-cache demand, the system aims to improve GPU memory utilization and support bursty long-context requests.

arxiv arXiv cs.AI · 5h ago

Reinforcement Learning for Computer-Use Agents with Autonomous Evaluation

The authors propose a reinforcement learning fine-tuning framework that utilizes autonomous vision-language evaluation as a scalable supervision signal for GUI agents, eliminating the need for manual labels or task-specific heuristics. By treating evaluator feedback as a noisy binary reward channel and deriving a noise-corrected estimator for Proximal Policy Optimization, the method addresses the difficulty of obtaining machine-readable rewards in open-ended desktop environments.