A user benchmarks running the 432GB Kimi K2.7 Code model split between a Mac Studio M3 Ultra and an NVIDIA RTX PRO 6000 using llama.cpp RPC, finding that prefill speeds improve while decode performance remains largely unchanged.
- Prefill speed increased by approximately 14.8% when offloading 20% of the model to the GPU.
- Decode speed showed only a minor 4.2% gain, resulting in a total request time improvement of about 12.3%.
- The setup achieved a practical maximum split of 20% on the RTX card with 128K context before failing at higher splits.
- RPC traffic was measured at roughly 112-113 MiB/s over a direct Ethernet connection, with network costs being more noticeable during prefill than decode.
The author concludes that while this configuration helps fit larger models across devices, the performance gains are limited by the network interconnect, making it primarily useful for capacity rather than significant speed improvements.