Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

NVIDIA introduces Nemotron-TwoTower, a diffusion language model that decouples context representation and iterative denoising into two separate networks to overcome capacity limitations in existing approaches. Built on the open-weight Nemotron-3-Nano-30B-A3B model and trained on 2.1T tokens, it retains 98.7% of the autoregressive baseline's quality while achieving 2.42X higher wall-clock generation throughput.

The architecture uses a frozen autoregressive context tower to causally process clean tokens and a trainable diffusion denoiser tower with bidirectional block attention for refining noisy blocks.
The model is based on Nemotron-3-Nano-30B-A3B, an open-weight 30B hybrid Mamba-Transformer MoE model.
Training was conducted on approximately 2.1T tokens, resulting in a model that maintains high quality while significantly improving generation speed.

This approach allows for parallel and iterative generation without sacrificing the quality of traditional autoregressive models, offering a more efficient alternative for language modeling tasks.