CrossPool is a serving engine designed for cold Mixture-of-Experts (MoE) models that addresses GPU memory inefficiencies by separating FFN weights and KV-cache into distinct pools. This disaggregation allows the system to consolidate static weights while dynamically provisioning active KV-cache demand, overcoming the limitations of monolithic memory allocation.
- CrossPool separates FFN weights and KV-cache into two GPU memory pools: a weights pool for consolidation and a KV-cache pool for dynamic serving.
- The system employs a KV-cache planner and virtualizer alongside a layer-wise pipeline scheduler to hide hidden-state transfers.
- Persistent kernels with control lowering are utilized to reduce CPU-GPU control overhead.
- CrossPool supports bursty long-context requests and reduces P99 TBT by up to 10.4x compared to state-of-the-art kvcached-based multi-LLM serving systems.
By enabling efficient GPU memory pooling, CrossPool improves utilization for cold models with sparse requests and provides stronger support for long-context inference workloads.