MoE models like GLM 5.2 and Deepseek V4 show that top 20% of experts handle 85% of activations. A multi-tier caching approach could shift these experts to GPU memory, leveraging high-bandwidth VRAM for faster inference. Existing systems such as PowerInfer, Lidenburg's llama.cpp, and HOBBIT demonstrate practical implementations of expert caching and prefetching.
Multi-Tier MoE Caching: Optimizing Expert Activation in Large Models
from English