The authors introduce the Style Text Embedding Benchmark (STEB), a comprehensive open-source benchmark designed to standardize the evaluation of style embeddings, which have previously been assessed using fragmented and inconsistent methods.
- STEB encompasses 96 datasets across 7 languages.
- The benchmark covers applications such as authorship verification, authorship retrieval, AI-text detection, and probing of linguistic features.
- Evaluation results show that semantic embeddings consistently fail in stylistic tasks.
- No single style embedding is universally superior across all evaluated tasks.
STEB aims to provide a unified framework for assessing style embeddings, addressing the lack of standardized evaluation metrics in the field.