Frontier LLMs Struggle to Write Fast Multi-GPU Kernels
ParallelKernelBench evaluates LLMs on writing fast multi-GPU CUDA kernels for 87 real workloads. The top model generates kernels that perform under a third of the speed of optimal implementations, though a few outputs surpass any existing public code.