Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)
This article reports on an update to the Ornith-1.0-35B model, featuring a native MTP draft head grafted onto the IQ4_XS body for self-speculative decoding in llama.cpp. The author provides comprehensive performance metrics including throughput, time-to-first-token (TTFT), and long-context capabilities on a single RTX PRO 6000 Blackwell GPU.