The authors propose OLIVE, a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives through view-augmented masked latent prediction and waveform reconstruction. This unified approach constrains early encoder features to retain signal-level information while shaping later contextual representations toward invariance for robust downstream performance.

  • Combines view-augmented masked latent prediction with waveform reconstruction under a single objective.
  • Uses reconstruction to constrain early encoder features to retain signal-level information.
  • Shapes later contextual representations toward invariance via masked latent prediction.
  • Improves results on generation and speaker tasks while maintaining competitive performance on recognition and semantic tasks.

OLIVE enables representations that support a broad range of tasks, specifically improving waveform reconstruction quality alongside enhanced performance on generation and speaker identification.