A developer has released a pure C implementation of an inference engine specifically designed for Qwen 3 models of size 4B and below. The project is available on GitHub as a learning resource that prioritizes code readability and educational value over raw performance.
- Written from scratch in pure C with no external dependencies other than libc, libm, cJSON, and optional OpenMP.
- Loads HF safetensors directly and performs 4-bit affine quantization on the fly without weight conversion.
- Implements KV caching and includes a built-in terminal-based chat interface.
- Achieves approximately 1 token per second on an i5-1240P laptop, prioritizing clarity over speed.
The engine serves as an educational tool for understanding transformer architecture and inference mechanics, offering a tractable alternative to dense, high-performance implementations.