A cloud systems engineer reports that using a single 4x4 bifurcation PCIe x16 card to connect four GPUs creates a bandwidth choke point for peer-to-peer (P2P) communication. This bottleneck saturates the fabric connecting the cards, resulting in performance worse than running with P2P disabled.
- The author identifies TP=4 with P2P enabled as yielding inferior performance compared to disabling P2P due to bridge saturation.
- Potential solutions include disabling P2P for a 10-15% gain, using Chinese SlimSAS bifurcation bridges ($150-$250), or purchasing specific Gen 4 PCIe bridges from Cpayne ($1200).
- Alternative configurations involve using pipeline parallelism instead of tensor parallelism, which only outperforms TP=4 with P2P off at high concurrency.
- Other options include used PLX switches on eBay, which risk firmware restrictions, or motherboards with dedicated x16 lanes requiring expensive retimer bifurcation cards ($130+ each).
The findings suggest that the cost and complexity of resolving the bifurcation bottleneck often outweigh the modest performance gains from P2P, making disabling it a practical choice for many setups.