The authors introduce REAR, a novel framework that extends test-time scaling (TTS) to preference alignment by modeling the task as a realignment problem. This approach addresses the limitation of existing TTS methods, which are typically restricted to verifiable domains like mathematics and coding.

  • REAR decomposes the reward function into two components: one related to the question and another to preference information.
  • The method derives a REAlignment Reward (REAR) that selectively rescales the proportions of these two reward terms.
  • REAR is formulated as a linear combination of token-level policy log-probabilities, ensuring computational efficiency.
  • It integrates easily with various TTS algorithms, including best-of-N sampling and tree search.
  • Experiments demonstrate scalability for diverse user requirements and generalization to mathematical and visual tasks.

This framework enables scalable test-time realignment for preference alignment tasks under diverse user requirements without the need for costly data curation or additional training.