Results
Sort
Reset
arxiv arXiv cs.CL · 10d ago

Dynamic Rollout Editing Reduces Overthinking in RL-Trained Reasoning Models

Dynamic Rollout Editing (DRE) addresses overthinking in RL-trained reasoning models by modifying successful trajectories post-answer emergence. DRE preserves the correct reasoning prefix while editing unnecessary continuation, weakening the credit assigned to redundant thinking without penalizing valid reasoning. Experiments across diverse tasks demonstrate its effectiveness in reducing overthinking.

arxiv arXiv cs.CL · 10d ago

LOGOS: A General-Purpose Generative Model for Natural Sciences

LOGOS is a unified generative language model that represents scientific objects and their interactions as token sequences in a shared grammar. It achieves consistent or superior performance across diverse natural science tasks, demonstrating the feasibility of a single model serving multiple domains. The model scales positively with parameter count, and its design suggests that AI for Science should align deeply with large language models through shared architectures and training.

arxiv arXiv cs.CL · 10d ago

ContextRL: Context-Aware RL for LLMs

ContextRL introduces an indirect auxiliary objective to improve long-horizon reasoning and multimodal performance in LLMs. It rewards models for selecting the context that supports a query-answer pair, using contrastive context data from coding agent trajectories and image-based visual questions. ContextRL achieves +2.2% and +1.8% gains over standard methods on long-horizon and visual QA benchmarks, with gains attributed to the selection objective, not data augmentation.

arxiv arXiv cs.CL · 10d ago

Language Models Encode Value of Their Current Trajectory

Qwen3-8B internally tracks the value of its current trajectory, defined as the likelihood of achieving its goals. This 'value' axis distinguishes confidence levels, backtracking behavior, and code correctness, and shows that preference optimization boosts confidence in rewarded behaviors. The model assigns low value to politically sensitive queries post-training, and fine-tuning increases confidence within specific domains.