EvoTensile uses evolutionary algorithms to tune GEMM kernels for AMD GPUs, improving NT layout performance from 20 to 40 TFLOPS on Strix Halo. This speedup represents a significant advance over unoptimized kernels, though it remains below the theoretical roofline of 59.4 TFLOPS.