Lab · Google DeepMind
arxiv arXiv cs.CL · 7d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.AI · 7d ago

WorldLines: Benchmarking Long-Horizon Embodied Agent Memory

WorldLines introduces a project-driven benchmark for long-horizon embodied household assistance, capturing extended household traces with dialogues, actions, and state changes. It enables evidence-linked samples for Memory QA and Embodied Task Planning, and proposes ObsMem, an observer-grounded memory framework that supports visibility-aware memories and state-aware decisions. Experiments highlight challenges in partial observability and memory translation, with ObsMem providing a stronger reference architecture for such settings.

arxiv arXiv cs.AI · 7d ago

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

AdsMind is a closed-loop multi-agent system that uses machine learning force fields and feedback to correct errors in adsorption configuration searches on catalyst surfaces. It achieves 100% and 98.8% success rates on AA20 and OCD-GMAE62 benchmarks, reduces energy dispersion by 14-fold compared to baselines, and maintains correct adsorption-energy signs in DFT validation, outperforming open-loop LLM agents.

arxiv arXiv cs.LG · 8d ago

NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment

NoiseTilt introduces NTRK, a reward-guided diffusion sampler that injects reward gradients via the noise term without altering the reverse kernel. By using a whitening operator, NTRK safely biases noise toward high reward, preserving sample quality while maintaining strong guidance. On aesthetic generation, NTRK achieves superior reward performance with 25 NFEs, reducing compute by 20× compared to state-of-the-art baselines.

arxiv arXiv cs.LG · 8d ago

Compositional Generalization in Language Model Reasoning

A hierarchical latent selection model shows that supervised fine-tuning and reinforcement learning work together to enable compositional generalization in language models. SFT provides raw module materials, while RL identifies and recombines atomic modules from compound traces to solve new problems. Training on compound traces leads to stronger generalization than isolated module training, and an effective protocol is found where SFT ensures module coverage and RL drives exploration of novel compositions.

arxiv arXiv cs.CL · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.CL · 8d ago

RubricsTree: Scalable Evaluation Framework for Personal Health Agents

RubricsTree introduces a hierarchical taxonomy of over 100 clinically-verifiable Boolean rubrics, evolved from 4,000 real user queries via human-in-the-loop curation. It enables scalable, expert-aligned evaluation of personal health agents by dynamically routing queries to relevant rubrics and outperforms baseline methods in alignment, context sensitivity, and model performance gains of up to 66% on HealthBench.