A new approach called negative token filtering enables stable single-rollout training by preventing false penalties on negative samples. The method improves performance on agentic tasks compared to group-based RL techniques, while matching group-based methods on reasoning tasks.