A user tested NVIDIA's Nemotron-3-Super-120B-A12B model, which combines hybrid Mamba and MoE architectures, achieving exact recall in needle-in-the-haystack tests up to 504,482 tokens. The model was run fully on GPU across four RTX 3090s using the i1-Q4_K_S quantization, demonstrating that its Mamba layers maintain a constant-size recurrent state rather than a growing KV cache.
- Decode speed ranged from 72 t/s at short context to 23 t/s at 504K tokens.
- Prefill speed decreased from ~2080 t/s at 30K tokens to 885 t/s at 504K tokens.
- The model maintained exact recall for buried needles at all tested depths (10%, 50%, and 90%) up to the maximum context length.
- VRAM usage was approximately 20GB per card, totaling around 71GB for the quantized model.
- In head-to-head comparison with MiniMax-M2.7-REAP on the same hardware, Nemotron provided roughly 2.7x faster decode speeds at equivalent context lengths while maintaining precision.
The architecture allows for efficient long-context processing by keeping context costs nearly constant, enabling high-speed inference even at half a million tokens without the performance degradation typical of full-attention models.