A user details the successful deployment of the MiniMax M2.7 Q3_K_XL model across six NVIDIA Tesla P40 GPUs, providing a complete hardware configuration and optimized inference settings for local LLM hosting.
- Hardware setup includes an Asus X99-E-WS motherboard with a modded BIOS, Intel Xeon E5-2680 v4 CPU, 128GB DDR4 RAM, and six P40 GPUs providing 144GB total VRAM via Gen3 x8 lanes.
- Benchmarks show that using F16 KV cache with Flash Attention enabled yields the best performance, achieving 105.91 tokens per second for prompt processing at a 32k context size.
- The optimal configuration uses layer split mode with equal distribution (1/1/1/1/1/1), batch size 2048, and ubatch size 256; tensor splitting caused crashes while Q8 KV cache proved slower than F16.
This guide offers a practical reference for users attempting to run large parameter models on consumer-grade hardware with limited VRAM per card by leveraging multi-GPU parallelism.