DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

The article introduces DiT-Reward, a method that converts a pretrained text-to-image Diffusion Transformer into a reward model by processing near-clean image latents and aggregating text-conditioned representations across transformer layers. This approach leverages generative representations to evaluate the quality of generated images without requiring separate training objectives.

DiT-Reward outperforms HPSv3 on all four evaluated preference benchmarks, achieving 85.6% on HPDv2 and 77.6% on HPDv3 using the same training data mixture.
A lightweight learned head can extract meaningful preference predictions from frozen generative backbone representations.
Downstream reward performance is strongest in middle-to-late layers and benefits from combining representations across different stages.
The method shows consistent positive scaling with generative backbone capacity.
When optimizing Stable Diffusion 3.5 Large with Flow-GRPO, DiT-Reward outperforms HPSv3 with clear gains in realism.
Direct latent scoring provides a 1.65x inference speedup over HPSv3 with comparable peak memory.

These results demonstrate that pretrained generative Diffusion Transformers provide transferable representations suitable for reward modeling and policy optimization tasks.