Researchers introduce BamiBERT, a new BERT-based pre-trained language model for Vietnamese designed to address limitations of the current standard, PhoBERT. Trained from scratch on a 129GB corpus for 20 epochs, it supports an extended context length of up to 2048 tokens and operates directly on raw input without external word segmentation.

  • Achieves the best score on 11 of 15 metrics across 8 Vietnamese benchmarks.
  • Sets a new state of the art among "base"-sized Vietnamese encoders.
  • Demonstrates strong cross-domain generalization capabilities.
  • Eliminates the need for external word segmentation by operating directly on raw input.

The model is released at https://huggingface.co/Qualcomm-AI-Research/BamiBERT, offering a robust alternative for Vietnamese text encoding tasks.