The authors introduce DiT-Reward, a method that converts a pretrained text-to-image Diffusion Transformer into a reward model by aggregating text-conditioned image representations across transformer layers. Evaluated under the same training data mixture as HPSv3, DiT-Reward outperforms HPSv3 on all four preference benchmarks, achieving 85.6% on HPDv2 and 77.6% on HPDv3. The study reveals that downstream reward performance is strongest in middle-to-late layers and benefits from combining representations across different stages. Even with a frozen generative backbone, a lightweight learned head can extract meaningful preference predictions from these representations. When used to optimize Stable Diffusion 3.5 Large with Flow-GRPO, DiT-Reward surpasses HPSv3 along the matched training trajectory, showing clear gains in realism. Additionally, direct latent scoring provides a 1.65x inference speedup over HPSv3 while maintaining comparable peak memory usage. These results demonstrate that pretrained generative Diffusion Transformers provide transferable representations for reward modeling and policy optimization.
DiT-Reward: Using Diffusion Transformer Representations for Text-to-Image Reward Modeling
from English