Home GPU Cluster Specs for MiniMax M3

A Reddit user details the hardware configuration used to run the MiniMax M3 model in AWQ-INT4 quantization via VLLM. The setup achieves approximately 30 tokens per second for a single stream and 960 tokens per second in batch mode.

2x RTX Pro 6000 Max-Q (96GB), 8x RTX 3090 (24GB), and 2x RTX 5090 (32GB) provide 448GB VRAM.
Processing is handled by a Threadripper 9960x with 128GB DDR5 SDIMM RAM across four channels.
The system utilizes pipeline parallelism over tensor parallel groups of 2 to manage the workload.

The user notes that while one million context tokens are possible for a single user, they aim for four concurrent streams despite the high power consumption and cost.