The article introduces ELDR, an expert-locality-aware decode router designed to improve latency in prefill-decode disaggregated serving for mixture-of-experts models. Unlike existing routers that only balance load, ELDR predicts activated experts from prefill activations and routes requests to workers with matching signatures.
- ELDR builds an expert signature from prefill activations to predict generation-phase experts.
- Offline balanced K-means partitions signature space across decode workers for routing decisions.
- Online locality-band routing directs requests to the least-loaded worker among those best matching the signature.
- A signature cache co-indexed with the KV cache maintains exact signatures under prefix caching.
- Evaluated on up to 40 GPUs in vLLM, ELDR reduces median TPOT by 5.9-13.9% across three MoE models and two workloads.
This approach addresses the latency disparities caused by varying expert weights in equally loaded workers, offering a more effective load-balancing strategy for MoE deployments.