A Reddit user details the hardware configuration used to run the MiniMax M3 model in AWQ-INT4 quantization via VLLM. The setup achieves approximately 30 tokens per second for a single stream and 960 tokens per second in batch mode.
- 2x RTX Pro 6000 Max-Q (96GB), 8x RTX 3090 (24GB), and 2x RTX 5090 (32GB) provide 448GB VRAM.
- Processing is handled by a Threadripper 9960x with 128GB DDR5 SDIMM RAM across four channels.
- The system utilizes pipeline parallelism over tensor parallel groups of 2 to manage the workload.
The user notes that while one million context tokens are possible for a single user, they aim for four concurrent streams despite the high power consumption and cost.