A theoretical analysis shows that off-policy degree, determined by gradient steps per rollout, significantly impacts importance sampling ratios and token update dominance. The study introduces Adaptive Clip Policy Optimization (ACPO), which adjusts clipping boundaries by token group variance, outperforming DAPO and CISPO on 3B and 7B models across mathematical, QA, and logic reasoning tasks.
Key Factors in RL for LLM Reasoning Revealed
from English