This article reports on an update to the Ornith-1.0-35B model, featuring a native MTP draft head grafted onto the IQ4_XS body for self-speculative decoding in llama.cpp. The author provides comprehensive performance metrics including throughput, time-to-first-token (TTFT), and long-context capabilities on a single RTX PRO 6000 Blackwell GPU.
- Single-stream decode speed increased by 1.3-1.35x, rising from 172.6 to 233.8 tokens per second.
- The next-token distribution is byte-identical to the target-only model for KLD 0.0, with a BF16 KLD of 0.073.
- The IQ4_XS-MTP graft occupies approximately 19.6 GB, positioning it between Q5_K_M and Q4_K_M on fidelity metrics.
- Throughput scales from ~243 tok/s at concurrency 1 to ~656 tok/s at concurrency 16 for the Q4_K_M quantization.
- Long-context prefill time scales from 94 ms at 512 tokens to approximately 6.3 seconds at 32k tokens.
The update allows users to benchmark and utilize a speculative decoding variant that offers significant speed improvements while maintaining high fidelity relative to larger, more memory-intensive quantizations.