The ggml project has optimized AMX performance by flattening the partition over n_batch * M, ensuring all threads participate in quantization. This change improves speed by up to 1.47x across various models and hardware configurations on CPU and GPU platforms, with results showing consistent gains in inference time.
ggml optimizes AMX with partition flattening
from English