CANDLE is a lightweight system that uses Connectionist Temporal Classification to deduplicate repeated characters in Arabic text, without relying on handcrafted rules or morphological analyzers. It achieves a Sentence Error Rate of 5.37% and reduces tokenizer fertility by up to 12.8%, lowering inference costs and improving context window usage.
CANDLE: Lightweight Arabic Noise Deduplication via CTC
from English