A developer has released a pure C implementation of an inference engine specifically designed for Qwen 3 models of size 4B and below. The project is available on GitHub as a learning resource that prioritizes code readability and educational value over raw performance.

  • Written from scratch in pure C with no external dependencies other than libc, libm, cJSON, and optional OpenMP.
  • Loads HF safetensors directly and performs 4-bit affine quantization on the fly without weight conversion.
  • Implements KV caching and includes a built-in terminal-based chat interface.
  • Achieves approximately 1 token per second on an i5-1240P laptop, prioritizing clarity over speed.

The engine serves as an educational tool for understanding transformer architecture and inference mechanics, offering a tractable alternative to dense, high-performance implementations.