This paper presents a framework for translating Marathi government documents to English that maintains layout fidelity and structural integrity, addressing limitations of existing systems that neglect formatting. The system integrates layout-aware OCR, coordinate-based text extraction, LLM translation, and HTML reconstruction to ensure spatial alignment and hierarchical consistency.
- Integrates layout-aware optical character recognition and coordinate-based text extraction for precise text handling.
- Utilizes large language models for translation while enforcing spatial alignment constraints.
- Reconstructs documents through HTML representations to preserve hierarchical elements and layout.
- Demonstrated improved structural preservation, translation coherence, and terminological consistency on real-world Marathi government PDFs compared to conventional pipelines.
The framework contributes toward scalable multilingual accessibility solutions for e-governance and administrative document processing by enabling end-to-end document transformation.