The authors present a framework for modeling the energy consumption of Transformer training across multiple GPUs, addressing the need for sustainable system design as computational costs rise. By conducting controlled architectural sweeps on BERT models, they relate measured energy usage to lightweight proxies for compute, memory traffic, and hardware efficiency. The approach is inspired by roofline models and incorporates a speedup-based hardware-efficiency factor to account for tensor parallelism and fully sharded data parallelism. This methodology allows for the derivation of a scaling law model that accurately predicts training energy across heterogeneous configurations. The work highlights the critical importance of predicting energy consumption as model size and parallelism scale. It provides a practical tool for cost-aware design in large-scale natural language processing systems.
Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model
from English