All articles
arxiv arXiv cs.CL · 4h ago

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

The authors introduce Cortex, a framework that transforms web-scale corpus construction from flat document filtering into structured knowledge organization using an Ontological Corpus Graph (OCG). This three-layer structure unifies quality-refined content, hierarchical lightweight ontology, and cross-domain alignment to address the escalating data requirements of large language models.

arxiv arXiv cs.CL · 4h ago

DAIN: Dynamic Agent-Based Interaction Network for Efficient and Collaborative Multimodal Reasoning

Researchers introduce the Dynamic Agent-based Interaction Network (DAIN), a framework that reconceptualizes multimodal fusion as a dynamic, multi-agent collaborative process rather than relying on static architectures. DAIN utilizes a context-aware Meta-Controller to dynamically schedule sparse activation of specialized agents and orchestrates compressed communication for consensus-building.

arxiv arXiv cs.CL · 5h ago

Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

The authors propose PRP, a Proactive Routing Paradigm that accelerates inference in large multimodal models by enabling early decision-making through joint evaluation of draft and target model competence. This approach addresses the bottleneck of establishing reliable query difficulty signals in multimodal settings without relying on data-sensitive supervised fine-tuning or post-hoc token probabilities.

arxiv arXiv cs.CL · 5h ago

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

This article develops a theory for speculative decoding regimes that use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, rather than the stochastic distribution-preserving settings studied in existing literature. The authors characterize rejection regions as lower level sets of the target distribution to derive exact KL divergence requirements and sharp margin-based bounds for various acceptance criteria.

arxiv arXiv cs.CL · 6h ago

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.

arxiv arXiv cs.CL · 6h ago

Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi

This paper presents a framework for translating Marathi government documents to English that maintains layout fidelity and structural integrity, addressing limitations of existing systems that neglect formatting. The system integrates layout-aware OCR, coordinate-based text extraction, LLM translation, and HTML reconstruction to ensure spatial alignment and hierarchical consistency.