A Reddit user reports running the Qwen3.6 27B model in Q4 quantization using RAM offload on an RTX 3060 with 12GB VRAM, noting a DRAM bandwidth of only around 30GB/s during inference.
- The user achieved a throughput of 3.12 tokens per second with an 18K token context, questioning whether the bottleneck lies in LM Studio's implementation or their CPU hardware.
- Testing with a smaller prompt and 6 CPU threads using Q8 KV cache and 37 GPU offload layers increased throughput to 4.95 tokens per second while maintaining 30-35GB/s bandwidth.