TAPO advances self-distillation by constructing explicit micro-reflective trajectories that retain erroneous reasoning and insert natural-language diagnoses. These trajectories, derived from correct and incorrect model rollouts, provide fine-grained error corrections anchored in the model's own reasoning, improving both first-pass reasoning and error correction compared to GRPO.
arxiv
arXiv cs.LG
·
7d ago
·
src: 8d ago
·
research
TAPO: Self-Distillation with Micro-Reflective Trajectories
from English
Importance 3/3
New feature vs. leaders
New harness with differentiators
arXiv cs.LG
OpenAI
Google DeepMind
Meta AI
Evaluation & benchmarks
Reasoning models
Training methods