MoE Models Show Device-Dependent Inference Performance

An empirical study finds that Mixture-of-Experts models do not consistently outperform dense models on consumer or edge hardware. On the Apple M2 Pro, OLMoE-1B-7B is only 10% slower than a comparable dense model, while on the NVIDIA Jetson Orin Nano, it is 31% slower with 2.1 times higher energy per token, due to memory and KV-cache constraints. The results indicate that sparse activation benefits are limited by total-parameter memory footprint, especially on bandwidth-bound edge devices.