A user currently running dual RTX 3090s is considering adding a third card to address VRAM limitations that restrict concurrent requests at 256k context length. The proposed setup involves placing the third GPU in pipeline parallel with the existing two to increase capacity without suffering bandwidth bottlenecks.
- Current setup uses dual RTX 3090s providing 48GB of VRAM.
- Single stream performance is already maximized at over 140 TPS on standard benchmarks.
- The user experiences Out Of Memory (OOM) errors when attempting more than two concurrent requests due to KV-Cache constraints.
- The plan involves connecting a third GPU via PCIe 4.0 in a pipeline parallel configuration.
The author is seeking community feedback on whether similar multi-GPU setups have been tested and what results were achieved regarding single stream versus concurrent stream performance.