Large Language Models Fail to Translate Fongbe Accurately

Evaluations show Fongbe translations achieve poor quality (1.0-2.2/5) compared to Hausa's acceptable scores (4.0-4.5/5), with a consistent 3x BLEU gap. Automatic metrics like BERTScore show embedding collapse and weak human correlation, especially for Hausa, while Gemini outperforms others for Fongbe and GPT-4o for Hausa in human judgments. Minimum sample sizes of 2,500 sentences are needed for stable model rankings.