A developer has open-sourced the code for an MLX-based inference kernel designed to run the Gemma 12B model locally on consumer hardware, specifically targeting M-series MacBooks.
The project is built around the constraints of a 16GB MacBook Pro and aims to bridge the gap between MLX and CUDA libraries for local model development. The author notes that while integrating DSpark was attempted, the drafter model's memory requirements exceeded the 16GB threshold, suggesting future work on quantization or training a smaller drafter.
Current focus is on finalizing native graph integration and validating Multi-Token Prediction (MTP), with theoretical throughput capped at 20-30 tokens per second due to memory bandwidth limits. The code is provided as an experimental learning resource rather than a productized solution, though the author plans to use it as a baseline for optimizing Gemma models on NVIDIA hardware.