The OPERA framework addresses the instability of applying reinforcement learning to open-ended tasks by replacing external judge models with intrinsic rewards derived from perplexity dynamics. This approach quantifies uncertainty reduction at critical reflective states, eliminating stylistic biases and positional inconsistencies common in LLM-as-a-judge systems. During the cold-start phase, the method utilizes guiding words to synthesize diverse reasoning traces and employs perplexity-prioritized rollouts to identify logically consistent branches. This pipeline generates a large-scale dataset of 20,000 high-quality reasoning trajectories for training. Implementing OPERA on the Qwen3-8B model establishes a new state-of-the-art among open-source models. The system achieves parity with or surpasses proprietary models like Gemini2.5 and MiniMax-M2.5 in specific open-ended tasks. Empirical evaluations confirm the scalability and efficacy of this objective perplexity-based alignment strategy.
OPERA: Aligning Open-Ended Reasoning via Objective Perplexity-based Reinforcement Learning
from English