arxiv arXiv cs.LG · 8d ago · research

SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs

from English

SoftMoE replaces discrete top-k routing with a differentiable soft top-k LapSum relaxation, enabling gradient-based optimization of expert selection. It learns to allocate expert activation non-uniformly across layers, with later layers activating more experts, while using significantly fewer experts than traditional sparse MoE.

Importance 2/3 New harness with differentiators arXiv cs.LG Google DeepMind Meta AI OpenAI Evaluation & benchmarks Reasoning models Training methods

Read original