This article introduces BERTomelo, a next-generation monolingual encoder specifically optimized for the Portuguese language using the ModernBERT architecture.

  • Utilizes ModernBERT with Base and Large versions featuring a 1,024-token context window.
  • Incorporates hardware-level optimizations including FlashAttention and alternating attention mechanisms.
  • Trained on ClassiCC-PT, a corpus of 106 million high-quality Portuguese documents.
  • Outperforms previous Portuguese encoders like BERTimbau and Albertina in scalability and efficiency.
  • Demonstrates robust performance in downstream tasks such as STS and NER compared to massive multilingual models.