Epi2Diff: Using LLM Reasoning Traces to Predict Human Item Difficulty

Researchers introduce Epi2Diff, a framework that maps Large Reasoning Model (LRM) traces into cognitively grounded episode sequences to predict human item difficulty in educational assessment. By modeling difficulty through reasoning scale, effort allocation, and state transitions, the method provides an interpretable alternative to costly human calibration.

Epi2Diff groups trace segments into functional problem-solving states to extract compact episode-dynamic features combined with semantic item representations.
Experiments on four real-world datasets show consistent outperformance of baselines including fine-tuned small language models and LLM in-context learning.
On SAT-derived classification benchmarks, Epi2Diff achieves an 8.1% average relative gain over supervised LLM fine-tuning baselines.
Analysis reveals that harder items induce more effortful, iterative, and implementation-centered episode dynamics rather than merely longer responses.

This approach demonstrates that cognitive episodes in reasoning traces offer a predictive and interpretable process representation for educational measurement.