This study investigates whether current language model scaling paradigms can close fidelity gaps in social simulations across opinion modeling, behavioral simulation, and longitudinal forecasting. Using 85 Qwen3 transformer models trained on the DCLM corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs, the authors analyze the relationship between compute scale and simulation accuracy.
- Scaling laws applied to 35 open-weight models up to 70B parameters predict that most behavioral and opinion tasks will improve rapidly with scale, particularly for populations well-represented in English web corpora.
- Longitudinal forecasting and underrepresented opinions scale more slowly, especially when less correlated with general knowledge benchmarks like MMLU.
- Scaling fails to improve model calibration for human cognitive biases such as risk aversion or heuristics like learning correlated rewards, even with fine-tuning from 0.5B to 8B parameters.
The authors conclude that while scale generally improves social simulations, reliability decreases in low-resource domains and for specific human-like behaviors that do not correlate with general reasoning capabilities.