Topic · AI agents
arxiv arXiv cs.CL · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 7d ago

SenFlow: Advanced AI-Generated Text Detection in Hybrid Documents

SenFlow introduces a novel method for detecting AI-generated text in hybrid documents by modeling inter-sentence dependencies. It achieves state-of-the-art performance on MOSAIC, a benchmark of 16,000 documents from PubMed and XSum, with a +4.15 pp Macro-F1 gain on cross-domain transfer. SenFlow reveals that AI-generated content still exhibits generator-dependent sentence-length patterns, exploitable by sentence-level detectors despite perplexity filtering.

arxiv arXiv cs.CL · 7d ago

GraphPO: Graph-based Policy Optimization for Reasoning Models

GraphPO introduces a directed acyclic graph framework to represent reasoning rollouts, merging semantically equivalent paths to reduce redundant exploration. It assigns efficiency and correctness advantages to edges, improving inference efficiency and process supervision while reducing advantage-estimation variance. Experiments show GraphPO outperforms chain- and tree-based methods on three LLMs across reasoning and agentic search tasks under identical token or response budgets.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

WorldLines: Benchmarking Long-Horizon Embodied Agent Memory

WorldLines introduces a project-driven benchmark for long-horizon embodied household assistance, capturing extended household traces with dialogues, actions, and state changes. It enables evidence-linked samples for Memory QA and Embodied Task Planning, and proposes ObsMem, an observer-grounded memory framework that supports visibility-aware memories and state-aware decisions. Experiments highlight challenges in partial observability and memory translation, with ObsMem providing a stronger reference architecture for such settings.

arxiv arXiv cs.AI · 7d ago

AdsMind: Physics-Grounded Multi-Agent System for Adsorption Discovery

AdsMind is a closed-loop multi-agent system that uses machine learning force fields and feedback to correct errors in adsorption configuration searches on catalyst surfaces. It achieves 100% and 98.8% success rates on AA20 and OCD-GMAE62 benchmarks, reduces energy dispersion by 14-fold compared to baselines, and maintains correct adsorption-energy signs in DFT validation, outperforming open-loop LLM agents.