A Reddit user is seeking community feedback regarding the performance of large language models on systems equipped with four to eight NVIDIA RTX 6000 PRO GPUs. The inquiry specifically targets users who have between 384GB and 768GB of VRAM available for running models such as GLM 5.2, Kimi 2.7, and DeepSeek V4 Pro. The poster notes that while these models can technically run at 4-bit quantization, they may not fit within the memory constraints when using 8-bit precision. They reference a benchmark repository but highlight that it lacks data for the most recent model releases. A key concern raised is whether the performance degradation from using 4-bit versus 8-bit quantization is significant enough to impact agentic or programming tasks. The user also asks which inference backends, such as vLLM or SGLang, are currently being utilized by others in this hardware configuration.
Reddit Inquiry on Running Large Models with 4x-8x RTX 6000 PROs
from English