A study reveals that essay quality information in large language models is encoded in linearly accessible forms within their hidden representations. These representations emerge layer-by-layer, remain stable across prompts, and show partial transfer across different essay prompts, with longer essays relying more on deeper model layers. The research identifies specific 'essay scoring neurons' whose activation strongly correlates with scores and can be influenced by targeted interventions.
Essay Quality Representations in LLMs Found to Be Linearly Accessible
from English