ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

The article introduces ELDR, an expert-locality-aware decode router designed to improve latency in prefill-decode disaggregated serving for mixture-of-experts models. Unlike existing routers that only balance load, ELDR predicts activated experts from prefill activations and routes requests to workers with matching signatures.

ELDR builds an expert signature from prefill activations to predict generation-phase experts.
Offline balanced K-means partitions signature space across decode workers for routing decisions.
Online locality-band routing directs requests to the least-loaded worker among those best matching the signature.
A signature cache co-indexed with the KV cache maintains exact signatures under prefix caching.
Evaluated on up to 40 GPUs in vLLM, ELDR reduces median TPOT by 5.9-13.9% across three MoE models and two workloads.

This approach addresses the latency disparities caused by varying expert weights in equally loaded workers, offering a more effective load-balancing strategy for MoE deployments.