Researchers present BaseRT, a native Metal inference runtime for large language models on Apple Silicon that achieves the highest reported inference throughput to date. By utilizing chip-specific kernel fusion and unified memory-aware optimization, it overcomes the overhead found in existing frameworks like llama.cpp and MLX.
- Supports eight quantization formats (Q2 to FP16) across all Apple M-series devices.
- Achieves up to 1.56x higher decode throughput than llama.cpp and 1.35x higher than MLX on M3 and M4 Pro devices.
- Shows substantially larger margins on prefill for mixture-of-experts models.
- Maintains consistent best-in-class throughput for models ranging from sub-1B to 30B parameters.
The authors argue that performance-optimized local runtimes are critical for the emerging edge inference paradigm, helping address privacy requirements, latency constraints, and cloud cost pressures.