wav2VOT: Automatic estimation of voice onset time, closure duration, and burst realisation with wav2vec2

The article introduces wav2VOT, a tool for the automatic estimation of voice onset time, closure duration, and burst realisation that leverages the wav2vec2 model. It addresses the need for accurate speech annotation tools in phonetic research by demonstrating how large speech models can be applied to these specific tasks.

wav2VOT performs comparably with current approaches on unseen datasets.
The tool can estimate features with high accuracy when fine-tuned.
Analysis shows high fidelity across stop voicing and place of articulation.

These results demonstrate that large speech models are capable of producing accurate annotations, motivating their further exploration as tools in phonetic research pipelines.