Spec-AUF introduces a training method for masked block drafters in speculative decoding that aligns the training objective with inference behavior by restricting cross-entropy loss to the accepted prefix. The approach approximates prefix-sensitive supervision by keeping the loss support only through the drafter's first predicted failure, without requiring auxiliary objectives or changes to the inference pipeline.
- On Qwen3-8B, AUF raises the DFlash drafter's average emitted length from 2.40 to 2.61 across six benchmarks.
- The method transfers to Domino's two-branch head, improving performance from 2.56 to 2.68.
- Standard exponential position-decay weighting becomes empirically inert once AUF truncates the support.
This change improves draft acceptance rates by ensuring the model is supervised only on tokens that are actually committed during generation.