Gemma-4-12b audio inference on MacBook M2 Max reaches 16.8 tok/s

A user benchmarks the Gemma-4-12b model with audio input on a MacBook M2 Max equipped with 64GB of RAM, achieving 16.8 tokens per second during first-inference.

The setup utilizes a Tauri2 desktop app with native Rust FFI into llama.cpp via the llama-cpp-2 library, enabling Metal acceleration. The model used is the gemma-4-12b-it-Q5_K_S quantized by Unsloth. Audio input consists of a 607 KB 16-bit mono 16 kHz PCM WAV file processed through the mtmd multimodal audio marker.

The total path speed breaks down into 2 seconds for audio prefill and 3.7 seconds for decoding, with decode alone reaching 26 tok/s. The user seeks feedback on performance levels and suggestions for speeding up inference.