Researchers introduce AGC-Bench, a unified benchmark for artificial general creativity constructed from 3,101 screened papers and covering 78 datasets across domains like brainstorming and STEM. To address bias in automated evaluation, the team fine-tunes Qwen3-30B on bias-corrected ratings to create AGC-Judge, an open-weight model that robustly scores new creativity benchmarks.

  • The benchmark spans 78 datasets including narrative, humor, and figurative language, evaluated via an agentic harness standardized to HELM.
  • Factor analysis across 83 LLMs recovers a single creativity factor 'c' explaining 81.5% of variance, which is related to but separable from general intelligence.
  • Prompting models to "be creative" boosts performance significantly more than enabling reasoning, confirming the benchmark tracks creativity over general ability.
  • On a human-matched subset, top humans still outperform top LLMs in creativity tasks.

The release provides open infrastructure for measuring AI creativity at scale, offering insights into how AI creativity compares to human capabilities and general intelligence.