Tencent-Hunyuan has introduced GEAR (Guided End-to-End AutoRegression), a method that jointly trains a vector-quantized tokenizer and an autoregressive generator to improve image synthesis. Unlike traditional two-stage approaches, GEAR uses representation alignment to allow the AR model to guide the tokenizer during training.

  • The method resolves gradient flow issues by using a dual read-out of codebook assignments, combining hard next-token prediction with a differentiable soft branch for alignment.
  • This approach shifts the alignment burden to the AR model, making its features more DINOv2-like while the tokenizer becomes less so.
  • GEAR achieves up to 10x faster ImageNet gFID convergence compared to the LlamaGen-REPA baseline and learns better patch-level features.
  • The technique is generalizable across VQVAE, LFQ, and IBQ quantizers and supports text-to-image generation.

The authors consider this important because it enables faster training convergence and better feature learning by allowing the generator to directly influence the tokenizer's representation.