ROMEVA addresses sub-word fragmentation in Roman Urdu by combining sub-word-average initialization and PCA-guided anchor loss to stabilize embeddings. While ROMEVA best preserves pretrained embeddings, naive fine-tuning achieves superior sentiment classification performance, indicating a trade-off between embedding stability and downstream performance in morphologically inconsistent languages.
ROMEVA: Geometry-Preserving Vocabulary Expansion for Roman Urdu Language Models
from English