A user reports achieving a 30-40% increase in token generation speed by pairing the Ornith-1.0-35B model as a draft model with Qwen3.6-35B-A3B-DFlash using llama-server.

  • The configuration uses Ornith-1.0-35B-GGUF (Q8_0) as the speculative draft model via the `--spec-type draft-dflash` flag.
  • Testing on a 50k context of mixed JavaScript code and Wikipedia text yielded an 80% token acceptance rate.
  • The setup involves running llama-server with specific parameters for context length, temperature, and draft steps.

While this combination improves generation speed, it comes at the cost of significantly slower prompt processing times, meaning it is not a universal solution.