The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

This article introduces a framework for modeling the energy consumption of Transformer training on multiple GPUs, aiming to address growing computational costs in sustainable system design.

The model relates measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency using controlled architectural sweeps of BERT models.
It incorporates a speedup-based hardware-efficiency factor inspired by roofline models to capture the effects of tensor parallelism and fully sharded data parallelism.
The authors derive a scaling law model that accurately predicts training energy across heterogeneous configurations.

This approach enables accurate prediction of energy consumption, which is critical for cost-aware and sustainable design as Transformer models scale in size and parallelism.