A setup using four 5060 Ti GPUs (totaling $1800) achieves 55 tokens per second with Qwen3.6-27B-FP8, supporting 262K context length and bfloat16 KV cache. The configuration uses P2P and FlashInfer, with benchmark results showing 55.67 output token throughput and 65.25% speculative decoding acceptance rate.