The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

A study introduces the "riddle riddle" paradigm to determine whether large language models (LLMs) rely on flexible reasoning or pattern matching, revealing that humans and LLMs fail in opposite directions. In experiments involving nine state-of-the-art LLMs and 100 human participants, LLMs performed significantly worse on riddle riddles than on genuine riddles, while humans showed the reverse trend.

LLMs achieved 84.9% accuracy on genuine riddles but only 50.7% on riddle riddles, whereas humans scored 50.5% on genuine riddles and 80.5% on riddle riddles.
Error analysis indicates that 90.8% of LLM errors on riddle riddles resulted from inappropriate use of inventive reasoning, compared to only 57.6% of human errors on genuine riddles.
The findings suggest that strong LLM performance on genuine riddles likely reflects memory retrieval rather than flexible strategy selection based on content.

The authors argue that without stimuli designed to elicit this contrast, it is easy to conflate LLM-generated outputs that resemble reasoning with actual flexible reasoning capabilities.