Tensor split performance on low-bandwidth (TB3) eGPUs, and a question

A user reports testing tensor split mode with two Morefine G1 4090M 16GB eGPUs connected via Thunderbolt 3 at 40Gbps. While layer split mode yields high token rates for prefill (PP) and text generation (TG), tensor split mode saturates both cards during TG but suffers from poor PP performance due to bandwidth saturation.

Layer split mode achieves approximately 1300t/s PP and 26t/s TG (35-40t/s with MTP) for Qwen3.6-27B @ Q4.
Tensor split mode with MPT (draft-n-max 3) reaches 50-60t/s during TG, saturating both cards at 140W each and utilizing about 800MB/s total bandwidth.
PP performance in tensor split mode drops to 500-600t/s with an empty context because the low-bandwidth links are saturated.

The author asks if it is theoretically possible to implement a hybrid split that runs prefill on one card at a time while decoding across both, aiming to combine the high TG performance of tensor split with the lower bandwidth requirements of layer split for PP.