Researchers introduce QuasiMoTTo, a method that improves sample efficiency in language model inference and reinforcement learning by using correlated samples instead of independent ones. The approach reparameterizes autoregressive sampling as inverse-CDF sampling and draws underlying uniforms with quasi-Monte Carlo (QMC) to spread them more evenly across the output space.

  • QuasiMoTTo matches i.i.d. pass@k accuracy with 25-47% fewer samples across four reasoning benchmarks.
  • The method often saturates an upper bound on pass@k that holds for any marginal-preserving sampler.
  • In policy-gradient RL (GRPO), QuasiMoTTo matches i.i.d. performance with 50% fewer training steps.

These gains result from higher coverage, which yields a stronger learning signal per batch.