Researchers present BaseRT, a native Metal inference runtime for large language models on Apple Silicon that achieves the highest reported inference throughput to date. By utilizing chip-specific kernel fusion and unified memory-aware optimization, it overcomes the overhead found in existing frameworks like llama.cpp and MLX.

  • Supports eight quantization formats (Q2 to FP16) across all Apple M-series devices.
  • Achieves up to 1.56x higher decode throughput than llama.cpp and 1.35x higher than MLX on M3 and M4 Pro devices.
  • Shows substantially larger margins on prefill for mixture-of-experts models.
  • Maintains consistent best-in-class throughput for models ranging from sub-1B to 30B parameters.

The authors argue that performance-optimized local runtimes are critical for the emerging edge inference paradigm, helping address privacy requirements, latency constraints, and cloud cost pressures.