A user reports severe performance issues with their two AMD R9700 GPUs, failing to run vLLM with tensor parallelism (tp=2) due to NCCL errors. Single-card inference shows extremely low throughput—30 tps for Qwen 0.6B and only 5 tps for a 27B INT4 AWQ model—despite proper ROCm installation and system configuration.