Researchers introduce QuasiMoTTo, a method that improves sample efficiency in language model inference and reinforcement learning by using correlated samples instead of independent ones. The approach reparameterizes autoregressive sampling as inverse-CDF sampling and draws underlying uniforms with quasi-Monte Carlo (QMC) to spread them more evenly across the output space.
- QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples across four reasoning benchmarks.
- The method often saturates an upper bound on pass@k that holds for any marginal-preserving sampler.
- In policy-gradient RL (GRPO), QuasiMoTTo matches i.i.d. performance with 50% fewer training steps.
These gains result from higher coverage, which yields a stronger learning signal per batch.