The authors introduce the Style Text Embedding Benchmark (STEB), a comprehensive open-source benchmark designed to standardize the evaluation of style embeddings, which have previously been assessed using fragmented and inconsistent methods.

  • STEB encompasses 96 datasets across 7 languages.
  • The benchmark covers applications such as authorship verification, authorship retrieval, AI-text detection, and probing of linguistic features.
  • Evaluation results show that semantic embeddings consistently fail in stylistic tasks.
  • No single style embedding is universally superior across all evaluated tasks.

STEB aims to provide a unified framework for assessing style embeddings, addressing the lack of standardized evaluation metrics in the field.