The author quantized the deepreinforce-ai/Ornith-1.0-35B model to Q3_K_M format, reducing its size to approximately 17 GB of VRAM while maintaining behavioral validity through KL divergence checks.

  • The Q3_K_M quantization reduces bits per weight from 16.01 to 3.87, resulting in a 16.8 GB file that is about 21% smaller than the Q4_K_M variant.
  • Validation against the BF16 baseline shows a mean KLD of 0.366 and an 84.4% top-1 token match rate, compared to 100% for Q6_K and 96.9% for Q8_0.
  • Throughput on a single GPU reaches ~240 tokens per second in single-stream mode and scales to ~493 tokens per second across 16 concurrent slots.
  • The author fixed a reasoning-mode serving bug where short coding requests returned empty final content, defaulting the serving scripts to REASONING=off.
  • A corrected top-64 next-token KL probe was used for validation, and upstream Q4/Q5/Q6/Q8 models were mirrored and revalidated within the same repository.

This quantization allows the 35B parameter model to run comfortably on a single GPU with significantly lower memory requirements than higher precision variants, while providing verified performance metrics for users.