CrossPool is a serving engine designed for cold Mixture-of-Experts (MoE) models that disaggregates FFN weights and KV-cache into separate GPU memory pools to address memory inefficiencies in sparse request scenarios. By consolidating static weights and dynamically provisioning active KV-cache demand, the system aims to improve GPU memory utilization and support bursty long-context requests.
- Separates FFN weights and KV-cache into distinct GPU memory pools: a weights pool for consolidated storage and a KV-cache pool for dynamic serving.
- Utilizes a KV-cache planner and virtualizer alongside a layer-wise pipeline scheduler to hide hidden-state transfers.
- Employs persistent kernels with control lowering to reduce CPU-GPU control overhead.
- Outperforms state-of-the-art kvcached-based multi-LLM serving systems, reducing P99 TBT by up to 10.4x.
This approach allows for efficient handling of aggregate active demand rather than reserving worst-case capacity per model, thereby supporting bursty workloads more effectively.