BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

The authors introduce BehaviorBench, a comprehensive benchmark designed to evaluate foundation models across diverse behavioral science tasks and populations. The study assesses four core capabilities—behavior prediction, strategic decision-making, subject-trait inference, and behavioral knowledge application—at both individual and distributional levels.

BehaviorBench evaluates model outputs at individual and distributional levels to capture population-level alignment essential for behavioral validity.
The benchmark tests four capabilities: behavior prediction/simulation, strategic decision-making, subject-trait inference, and behavioral knowledge application.
Be.FM-1.5 is developed as an extension of the Be.FM family, fine-tuned on behavioral data using tasks from BehaviorBench.
Proprietary general-purpose models excel at individual-level prediction, while behavioral foundation models achieve stronger distributional alignment.
Be.FM-1.5 leads on distributional metrics and remains competitive on individual-level metrics, suggesting proper adaptation can close the gap.

The results highlight the importance of distributional evaluation for developing behaviorally aligned AI systems and demonstrate the potential of Be.FM-1.5 for a broad range of behavioral science studies.