OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

The authors propose OLIVE, a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives through view-augmented masked latent prediction and waveform reconstruction. This unified approach constrains early encoder features to retain signal-level information while shaping later contextual representations toward invariance for robust downstream performance.

Combines view-augmented masked latent prediction with waveform reconstruction under a single objective.
Uses reconstruction to constrain early encoder features to retain signal-level information.
Shapes later contextual representations toward invariance via masked latent prediction.
Improves results on generation and speaker tasks while maintaining competitive performance on recognition and semantic tasks.

OLIVE enables representations that support a broad range of tasks, specifically improving waveform reconstruction quality alongside enhanced performance on generation and speaker identification.