arxiv arXiv cs.AI · 7d ago · research

Data Recipe Boosts Long-Context Reasoning in LLMs

from English

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

Importance 3/3 Beats a top-lab benchmark arXiv cs.AI Alibaba (Qwen) AI agents Reasoning models Training data

Benchmarks

Benchmark	Model	Score
SWE-bench	Qwen3-4B	7.2pts
BrowseComp	Qwen3-4B	7pts
SWE-bench	Qwen3-30B-A3B	6.4pts
GAIA	Qwen3-4B	4.8pts
SWE-bench	Qwen3-8B	3.2pts

Read original