A new Transformer architecture introduces separate global and local branches for language modeling, using FiLM to dynamically coordinate them. Experiments show it outperforms single-branch and weakened dual-branch models on small datasets like TinyShakespeare and WikiText-2, with stable results across multiple seeds and channel-selective modulation patterns.
FiLM-Coordinated Dual-Branch Transformer for Language Modeling
from English