This article introduces BERTomelo, a next-generation monolingual encoder specifically optimized for the Portuguese language using the ModernBERT architecture.
- Utilizes ModernBERT with Base and Large versions featuring a 1,024-token context window.
- Incorporates hardware-level optimizations including FlashAttention and alternating attention mechanisms.
- Trained on ClassiCC-PT, a corpus of 106 million high-quality Portuguese documents.
- Outperforms previous Portuguese encoders like BERTimbau and Albertina in scalability and efficiency.
- Demonstrates robust performance in downstream tasks such as STS and NER compared to massive multilingual models.