CombEval is a dynamic benchmark that generates natural-language counting problems with verified answers using typed Cofola specifications. It evaluates 11 large language models and reveals persistent failures in handling ordered objects, indistinguishable elements, positional constraints, and nested dependencies, with errors rooted in constraint interpretation and counting principles.
CombEval: Benchmark for Combinatorial Counting in LLMs
from English