Running MiniMax M2.7 Q3 XL on 6x NVIDIA P40 GPUs

A user details the successful deployment of the MiniMax M2.7 Q3_K_XL model across six NVIDIA Tesla P40 GPUs, providing a complete hardware configuration and optimized inference settings for local LLM hosting.

Hardware setup includes an Asus X99-E-WS motherboard with a modded BIOS, Intel Xeon E5-2680 v4 CPU, 128GB DDR4 RAM, and six P40 GPUs providing 144GB total VRAM via Gen3 x8 lanes.
Benchmarks show that using F16 KV cache with Flash Attention enabled yields the best performance, achieving 105.91 tokens per second for prompt processing at a 32k context size.
The optimal configuration uses layer split mode with equal distribution (1/1/1/1/1/1), batch size 2048, and ubatch size 256; tensor splitting caused crashes while Q8 KV cache proved slower than F16.

This guide offers a practical reference for users attempting to run large parameter models on consumer-grade hardware with limited VRAM per card by leveraging multi-GPU parallelism.