EvoTensile: Evolutionary tuning of AMD Tensile GEMM kernels
EvoTensile uses evolutionary algorithms to tune GEMM kernels for AMD GPUs, improving NT layout performance from 20 to 40 TFLOPS on Strix Halo. This speedup represents a significant advance over unoptimized kernels, though it remains below the theoretical roofline of 59.4 TFLOPS.