A user has created a modified version of the local agentic coding model Ornith 35B FP8 E4M3 by integrating Multi-Token Prediction (MTP) drafter support, addressing the lack of out-of-the-box compatibility with vLLM.

  • The grafting process adds MTP capabilities to the existing model architecture.
  • Benchmarks show an 18% speed increase compared to running the model without MTP.
  • The average drafter acceptance rate achieved is 70%.
  • The modified model supports a full context window of 256k on RTX setups with over 80GB VRAM.

This modification provides a performance-optimized inference option for users running Ornith 35B on high-end local hardware.