The article details the performance of Tesla V100-SXM2-16GB modules for running local large language models, highlighting their high HBM2 bandwidth as a key asset for inference despite lacking bf16 or int8 tensor ops.
- A single module runs Gemma 4 26B entirely on-GPU, achieving 99.8 tok/s in TCC mode compared to 56.8 tok/s in WSL2/MCDM.
- Dual modules provide 32GB VRAM and roughly double bandwidth, allowing Qwen3.6-35B to run fully resident with tensor splitting.
- Under concurrent multi-agent loads with short prompts, aggregate throughput scales from 62.7 tok/s (1 agent) to 338.1 tok/s (16 agents).
- With realistic ~24k-token system prompts, aggregate throughput caps around 150-175 tok/s for 8-16 concurrent agents.
- Driver support is limited to versions R570 through R580, as Volta support ends in CUDA 13.3/R595.
- Dual setups require specific PSU transient response handling to prevent hard reboots under load.
The author notes that while Q4 quantization holds up for many tasks, it is a weak spot for long agent chains, and users can trade concurrency for quality by using Q6_K weights if the 32GB dual-module capacity allows.