This article introduces a framework for modeling the energy consumption of Transformer training on multiple GPUs, aiming to address growing computational costs in sustainable system design.
- The model relates measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency using controlled architectural sweeps of BERT models.
- It incorporates a speedup-based hardware-efficiency factor inspired by roofline models to capture the effects of tensor parallelism and fully sharded data parallelism.
- The authors derive a scaling law model that accurately predicts training energy across heterogeneous configurations.
This approach enables accurate prediction of energy consumption, which is critical for cost-aware and sustainable design as Transformer models scale in size and parallelism.