Lab · Microsoft Research
arxiv arXiv cs.AI · 6d ago

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant enables 4-bit KV caching for context-heavy agents, reducing P50 time-to-first-token by 3.47x in late rounds and boosting output throughput by 1.63x over FP8 KV baseline. It achieves this using FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA on AMD CDNA4 GPUs, with optimizations for decode-attention kernels and robust design choices like asymmetric K/V treatment and Walsh-Hadamard rotation.

arxiv arXiv cs.LG · 6d ago

Probe-and-Refine Tuning Improves Coding Agent Performance

A new method called probe-and-refine tuning uses synthetic bug-fix probes to iteratively improve repository guidance files with single-shot LLM calls, without agent loops or tool use. On SWE-bench Verified, it achieves a 33.0% mean resolve rate—14.5 percentage points higher than the initial static knowledge base—showing improved coverage rather than patch precision. The method enables agents to use larger step budgets effectively, and performance remains stable across models when diagnostic output is sufficient.

arxiv arXiv cs.LG · 6d ago

Execution-State Capsules for Low-Latency On-Device AI Serving

Execution-state capsules enable graph-bound checkpointing and restoration of complete execution state, including KV, recurrent, and convolution states, for low-latency, small-batch on-device AI serving. On RTX 5090 and Jetson AGX Thor, capsule restore achieves byte-exact and token-identical correctness, with sub-millisecond GPU operations and TTFT speedups up to 27x at 16k tokens, demonstrating significant latency reduction in interactive AI workflows.

arxiv arXiv cs.AI · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, accelerated safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.LG · 6d ago

CRAX: Fast Safe Reinforcement Learning Benchmarking

CRAX introduces a high-fidelity, fast safety benchmark for reinforcement learning using MuJoCo XLA. It achieves up to 100x speedups over CPU-based benchmarks via vectorization and hardware acceleration, featuring six environment suites and three agent-specific tasks across three difficulty levels. Evaluation of six safe RL methods shows no single approach dominates, highlighting trade-offs between performance and safety, with curriculum learning and safety transfer improving results.

arxiv arXiv cs.AI · 7d ago

User as Engram: Local Parametric Edits for Personal Memory

User as Engram proposes storing per-user facts as surgical, hash-keyed edits to a memory table, leaving reasoning in a shared adapter. This design achieves 5.6x higher indirect-reasoning accuracy and maintains base-level reasoning performance, with a memory footprint 33,000x smaller than per-user LoRA. The approach enables disjoint user edits that compose losslessly, outperforming retrieval pipelines beyond 100 facts.

arxiv arXiv cs.AI · 7d ago

ScenA: Reference-Driven Multi-Speaker Audio Scene Generation

ScenA conditions a text-to-audio foundation model on multiple reference voices and a natural language scene prompt to generate realistic multi-speaker conversations. It addresses the 'Reference Shortcut' issue by using a high-noise-biased training schedule, ensuring speaker assignment relies on text prompts rather than acoustic similarity. Evaluated on CoVoMix2-Dialogue, Scen- A outperforms existing systems in speaker-binding and produces rich, naturalistic audio with overlapping speech and ambient noise.

arxiv arXiv cs.AI · 7d ago

WorldLines: Benchmarking Long-Horizon Embodied Agent Memory

WorldLines introduces a project-driven benchmark for long-horizon embodied household assistance, capturing extended household traces with dialogues, actions, and state changes. It enables evidence-linked samples for Memory QA and Embodied Task Planning, and proposes ObsMem, an observer-grounded memory framework that supports visibility-aware memories and state-aware decisions. Experiments highlight challenges in partial observability and memory translation, with ObsMem providing a stronger reference architecture for such settings.