Finally seeing benefits of MTP after removing GGML_CUDA_ALLREDUCE

A user reported that removing the GGML_CUDA_ALLREDUCE environment variable led to a noticeable improvement in throughput (TPS) for MTP in local LLM inference. The change, which was previously considered beneficial, unexpectedly reduced overhead and improved performance, especially after extensive configuration trials.