Retrieval & RAG — korshunov.ai

Topic · Retrieval & RAG

Mistral Releases OCR 4 with Multilingual Support and Structured Output

Mistral OCR 4 introduces bounding boxes, block classification, and inline confidence scores for 170 languages across 10 language groups. It outperforms leading OCR systems in human preference evaluations with a 72% win rate and achieves the top score on OlmOCRBench (85.20), while offering self-hosted deployment in a single container and supporting enterprise use cases like RAG and document ingestion.

arxiv arXiv cs.CL · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

ProvenanceGuard introduces a source-aware verifier for MCP-based LLM agents that detects cross-source conflation by routing claims to specific evidence sources and comparing stated attribution with actual source ownership. It achieves block F1 of 0.802 and source accuracy of 0.858 on 260 source-eligible claims, outperforming source-blind baselines, and detects all injected attribution swaps in 50 clinical probes.

arxiv arXiv cs.AI · 8d ago

ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents

media Hugging Face Forums · 5h ago

Ontological Inversion: Flipping LLM Emotional Concepts via Negative Gain

The author introduces 'ontological inversion,' a technique designed to expand the one-directional inference nature of Large Language Models. This method allows models to capture nuanced, multifaceted concepts, such as memories that evoke both sorrow and joy simultaneously. The approach was developed by applying a negative gain factor during sweeps into the Niodoo steering architecture. It addresses the common limitation where LLMs overfit to singular emotional labels when prompted with personal experiences. By inverting concepts similarly to physics involution, the technique enables models to flip emotional states, such as transforming sorrowful memories into joyful ones. The work is shared via a GitHub repository titled 'ontological-inversion' by user Ruffian-L.

arxiv arXiv cs.CL · 23h ago

MMed-Bench-IR: A Multilingual Medical Retrieval Benchmark

MMed-Bench-IR introduces a heterogeneous benchmark for multilingual medical information retrieval across six languages. It evaluates cross-lingual alignment, concept discrimination, and evidence retrieval through three distinct tasks with no overlapping concepts or queries. Evaluation shows significant cross-lingual performance drops, with English biomedical encoders falling from 0.818 to 0.056 nDCG@10 when transitioning to Japanese, highlighting limitations undetected by English-only benchmarks.

arxiv arXiv cs.CL · 1d ago

Privacy-Preserving RAG via Multi-Agent Semantic Rewriting

A multi-agent framework sanitizes retrieved content by removing sensitive identifiers through semantic rewriting, reducing privacy leakage in targeted attacks. It maintains strong contextual fidelity with a BLEU-1 score of 0.122, outperforming SAGE's 0.117, and operates as an asynchronous preprocessing step with no added latency to online inference.

media Hugging Face Forums · 1d ago

Spaces tokens stop working after update

Users report that Spaces tokens no longer function after a recent update. No generated files are being saved, disrupting workflow and model execution.

media r/LocalLLaMA · 1d ago

LLM Medical Scribing Benchmark: Omissions Outnumber Hallucinations

A benchmark of 8 LLMs on 300 synthetic doctor-patient dialogues found 12 high-impact hallucinations and 520 clinically relevant omissions. Omissions were far more common than hallucinations, with DeepSeek excelling in prose and cost but missing many safety facts, while Claude Opus had fewest omissions but poorer prose quality.

arxiv arXiv cs.CL · 2d ago

ViRGo: Adaptive Routing for Visual Retrieval and Global Perception

ViRGo introduces a lightweight framework that adapts visual retrieval based on object scale. It uses intrinsic localization and semantic confidence to route between global perception, patch-based retrieval, and attention-based retrieval, improving accuracy-efficiency trade-offs without extra computation.

arxiv arXiv cs.CL · 2d ago

π-RAG: Oblivious Retrieval via Semantic Quantization and Transcendental Addressing

π-RAG decouples LLMs from sensitive data by using π's digits as an immutable, uneditable source of entropy. It introduces a semantic quantization layer that maps user inputs to canonical intent centroids, then uses cryptographic salt to generate deterministic offsets pointing to standardized payloads, ensuring oblivious retrieval and mathematical guarantees of data privacy.

arxiv arXiv cs.CL · 2d ago

Topic-to-Timestamp Alignment by Constrained Evidence Selection

A new method improves topic-to-timestamp alignment in meeting transcripts by selecting timestamped evidence instead of generating timecodes. On 420 queries from municipal meeting transcripts, it boosts Recall@5 to 50.0%, reduces MAE to 761.0 seconds, and increases parseable outputs from 373 to 419, showing that retrieval quality and output design are critical.

arxiv arXiv cs.CL · 2d ago

PeerCheck: Improving LLM-Generated Academic Reviews

PeerCheck analyzes differences between LLM and human academic reviews, finding LLMs focus on theory while humans prioritize methodology and experiments. The framework uses prompt engineering like Chain-of-Thought and retrieval-augmented generation, with CoT significantly improving review quality, though RAG introduces an unexpected 'paradox' that sometimes reduces quality.

arxiv arXiv cs.CL · 2d ago

The Token Tax of Epistemic Accuracy in Document-Grounded AI

A study compares retrieval-augmented generation (RAG) and long-context prompting in document-grounded AI. Long-context prompting achieves higher epistemic accuracy—73.1% vs. 65.4%—but at 26 times the per-query token cost, highlighting a significant token tax for broader evidentiary access.

arxiv arXiv cs.CL · 2d ago

Ablation Study of Agentic RAG Components with Local 7B Model

A controlled ablation study evaluates agentic RAG components using a local 7B model on HotpotQA. Fixed hybrid retrieval outperforms adaptive routing by 1.8 EM and 1.9 F1, while two retrieval iterations capture 95% of the gains from five. Query decomposition and cross-encoder reranking show statistically significant but smaller improvements.

arxiv arXiv cs.AI · 6d ago

AI Economist Agent: Model-Grounded Economic Analysis Framework

The AI Economist Agent uses RAG, knowledge graphs, and LLMs to generate economic narratives grounded in theory and data. It enables model-based analysis, evidence retrieval, and report generation, ensuring economic coherence and traceability through explicit model computations.

arxiv arXiv cs.LG · 6d ago

AI Economist Agent: Model-Grounded Economic Analysis Framework

arxiv arXiv cs.LG · 6d ago

Train, Retrieve, or Both? Head-to-Head on Statutory Citation for Ontario RTA

A four-arm comparison shows that retrieval is essential for accurate statutory citation under the Ontario Residential Tenancies Act. The SFT+RAG hybrid model achieves 0.481 exact-match with zero hallucinations, outperforming base and SFT-only models, and matches a pipeline using larger, specialized models without needing more data or larger training sets. Results are based on a small, human-verified real-world evaluation set and are preliminary.

arxiv arXiv cs.CL · 6d ago

Multi-Agent Transactive Memory Framework

Multi-Agent Transactive Memory (MATM) enables population-level storage and retrieval of agent-generated trajectories. It allows producer agents to share procedural knowledge with consumer agents, improving task performance and reducing interaction steps in interactive environments like ALFWorld and WebArena without coordination or joint training.

arxiv arXiv cs.CL · 6d ago

Tool-Intent Stabilization in Streaming RAG

A study measures tool-intent stabilization in Streaming RAG, defining when speculative tool queries converge to correct answers. On the CRAG benchmark, 73.9% of queries allow substantial latency hiding, with early stabilization observed in questions with verbatim retrievable evidence. Question type significantly predicts early versus late stabilization, informing when speculative triggers are effective.

arxiv arXiv cs.CL · 6d ago

CATCH-ME if you RAG: Multilingual Counterspeech Dataset for Hate and Misinformation

CATCH-ME introduces the first large-scale, multilingual dataset of contextually annotated, multi-turn counterspeech dialogues targeting hate and misinformation. The dataset covers five languages and focuses on seven marginalized groups, with dialogues grounded in verified fact-checking sources and including document- and chunk-level span annotations for RAG systems.