A novel variable-length tokenizer uses learnable global merging to enable cross-length representation alignment in diffusion models. This data-independent approach overcomes position-dependent semantics and improves the quality-compute trade-off on ImageNet 256×25-6 generation compared to prior methods.
Learnable Global Merging for Variable-Length Tokenization in Diffusion Transformers
from English