A new framework models energy consumption in Transformer training on multiple GPUs. It uses BERT architectural sweeps to link measured energy to compute, memory traffic, and hardware efficiency proxies. The model, inspired by roofline analysis, includes a speedup-based hardware-efficiency factor and predicts training energy across diverse GPU configurations.
Energy Consumption Model for Transformer Training
from English