Retrieval & RAG
media r/LocalLLaMA · 2d ago

Comparing Docling, Liteparse, MinerU, and Unstructured for On-Prem Document Processing

A university seeking on-premises document processing for academic workflows must use local parsers due to strict data governance policies banning cloud APIs. The user evaluates Docling, Liteparse, MinerU, and Unstructured, noting Docling excels in complex layouts with Apache 2.0 licensing but is slower; Liteparse offers good printed document performance with Tesseract OCR; MinerU uses PaddleOCR and handles French documents well despite longer setup; Unstructured supports multiple formats including DOCX and PPTX. The solution must support recurring, stable parsing of evolving PDFs with minimal formatting changes.

lab Mistral AI News · 2d ago

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

arxiv arXiv cs.LG · 6d ago

Train, Retrieve, or Both? Head-to-Head on Statutory Citation for Ontario RTA

A four-arm comparison shows that retrieval is essential for accurate statutory citation under the Ontario Residential Tenancies Act. The SFT+RAG hybrid model achieves 0.481 exact-match with zero hallucinations, outperforming base and SFT-only models, and matches a pipeline using larger, specialized models without needing more data or larger training sets. Results are based on a small, human-verified real-world evaluation set and are preliminary.