The authors introduce REAR, a novel framework that extends test-time scaling (TTS) to preference alignment by modeling the task as a realignment problem. This approach addresses the limitation of existing TTS methods, which are typically restricted to verifiable domains like mathematics and coding.
- REAR decomposes the reward function into two components: one related to the question and another to preference information.
- The method derives a REAlignment Reward (REAR) that selectively rescales the proportions of these two reward terms.
- REAR is formulated as a linear combination of token-level policy log-probabilities, ensuring computational efficiency.
- It integrates easily with various TTS algorithms, including best-of-N sampling and tree search.
- Experiments demonstrate scalability for diverse user requirements and generalization to mathematical and visual tasks.
This framework enables scalable test-time realignment for preference alignment tasks under diverse user requirements without the need for costly data curation or additional training.