CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

CrossPool is a serving engine designed for cold Mixture-of-Experts (MoE) models that disaggregates FFN weights and KV-cache into separate GPU memory pools to address memory inefficiencies in sparse request scenarios. By consolidating static weights and dynamically provisioning active KV-cache demand, the system aims to improve GPU memory utilization and support bursty long-context requests.

Separates FFN weights and KV-cache into distinct GPU memory pools: a weights pool for consolidated storage and a KV-cache pool for dynamic serving.
Utilizes a KV-cache planner and virtualizer alongside a layer-wise pipeline scheduler to hide hidden-state transfers.
Employs persistent kernels with control lowering to reduce CPU-GPU control overhead.
Outperforms state-of-the-art kvcached-based multi-LLM serving systems, reducing P99 TBT by up to 10.4x.

This approach allows for efficient handling of aggregate active demand rather than reserving worst-case capacity per model, thereby supporting bursty workloads more effectively.