Researchers introduce reinforcement learning with metacognitive feedback (RLMF) to address systemic deficiencies in large language models, such as hallucinating with high confidence and misrepresenting internal uncertainty. The method refines completion rankings during preference optimization based on the quality of a model's self-judgments of performance.
- RLMF operationalizes metacognition by using self-judgments to refine completion rankings during preference optimization.
- A novel metacognitive data selection mechanism identifies high-value training examples, outperforming naive active learning.
- The approach targets faithful calibration (FC) to align expressed confidence with intrinsic uncertainty through a two-stage decoupled process.
- RLMF surpasses standard reinforcement learning by up to 63% while preserving accuracy on diverse tasks.
This paradigm enhances LLM metacognition and alignment, suggesting that metacognitive performance serves as an effective reinforcement learning signal to overcome the limits of prior intrinsic feedback methods.