The authors propose OLIVE, a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives through view-augmented masked latent prediction and waveform reconstruction. This unified approach constrains early encoder features to retain signal-level information while shaping later contextual representations toward invariance for robust downstream performance.
- Combines view-augmented masked latent prediction with waveform reconstruction under a single objective.
- Uses reconstruction to constrain early encoder features to retain signal-level information.
- Shapes later contextual representations toward invariance via masked latent prediction.
- Improves results on generation and speaker tasks while maintaining competitive performance on recognition and semantic tasks.
OLIVE enables representations that support a broad range of tasks, specifically improving waveform reconstruction quality alongside enhanced performance on generation and speaker identification.