CrossPool is a serving engine designed for cold Mixture-of-Experts (MoE) models that addresses GPU memory inefficiencies by separating FFN weights and KV-cache into distinct pools. This disaggregation allows the system to consolidate static weights while dynamically provisioning active KV-cache demand, overcoming the limitations of monolithic memory allocation.

  • CrossPool separates FFN weights and KV-cache into two GPU memory pools: a weights pool for consolidation and a KV-cache pool for dynamic serving.
  • The system employs a KV-cache planner and virtualizer alongside a layer-wise pipeline scheduler to hide hidden-state transfers.
  • Persistent kernels with control lowering are utilized to reduce CPU-GPU control overhead.
  • CrossPool supports bursty long-context requests and reduces P99 TBT by up to 10.4x compared to state-of-the-art kvcached-based multi-LLM serving systems.

By enabling efficient GPU memory pooling, CrossPool improves utilization for cold models with sparse requests and provides stronger support for long-context inference workloads.