BamiBERT: A New BERT-based Language Model for Vietnamese

Researchers introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese designed to address limitations of the current standard, PhoBERT. Trained from scratch on a 129GB corpus for 20 epochs, it supports an extended context length of up to 2048 tokens and operates directly on raw input without external word segmentation.

Achieves the best score on 11 of 15 metrics across 8 Vietnamese benchmarks.
Sets a new state of the art among "base"-sized Vietnamese encoders.
Demonstrates strong cross-domain generalization capabilities.
Eliminates the need for external word segmentation by operating directly on raw input.

The model is released at https://huggingface.co/Qualcomm-AI-Research/BamiBERT, offering a robust alternative for Vietnamese text encoding tasks.