Grounding LLM Reasoning under Incomplete Graph Evidence
This article presents a theoretical framework for grounding large language model reasoning trajectories when relying on incomplete knowledge graph evidence rather than complete truth states.
This article presents a theoretical framework for grounding large language model reasoning trajectories when relying on incomplete knowledge graph evidence rather than complete truth states.
This article proposes a novel multi-agent system that emulates human annotator decision-making processes to detect and debunk disinformation, achieving superior results compared to individual Large Language Models like GPT-4 and GPT-3.5.
This article develops a theory for speculative decoding regimes that use greedy decoding, relaxed acceptance rules, or tree-based candidate sets, rather than the stochastic distribution-preserving settings studied in existing literature. The authors characterize rejection regions as lower level sets of the target distribution to derive exact KL divergence requirements and sharp margin-based bounds for various acceptance criteria.
Researchers present DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support the development and evaluation of automatic systems for detecting personally identifiable information. This resource addresses privacy concerns in sensitive domains by providing annotated data across 11 languages and eight interaction scenarios.
The authors propose a novel training approach for end-to-end automatic speech recognition (ASR) that addresses noisy labels and lack of domain specificity in large-scale weakly supervised datasets. The method involves pretraining on the full dataset, continued pretraining on a filtered subset based on character error rate, and fine-tuning on acoustically similar samples from that subset.
A user tested Qwen3.6-27B (8-bit) alongside GLM5.2 using a coding harness that employs three critics—code review, test review, and Playwright e2e—to validate output quality.
This paper introduces DriftGuard, a framework that combines multi-monitor drift detection with selective model updating to address evolving toxicity in automated moderation systems. The system tracks specific safety-relevant shifts, such as identity-harm and toxic-risk drift, rather than relying solely on global distributional changes.
The authors introduce 5ting, a system designed for the SemEval-2026 Task 8 (MTRAGEval) which evaluates multi-turn Retrieval Augmented Generation (RAG) systems. The system addresses challenges such as context drift, under specification, and hallucination risk by combining dense retrieval with LLM-based reranking and faithfulness control.
The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.
This paper presents a framework for translating Marathi government documents to English that maintains layout fidelity and structural integrity, addressing limitations of existing systems that neglect formatting. The system integrates layout-aware OCR, coordinate-based text extraction, LLM translation, and HTML reconstruction to ensure spatial alignment and hierarchical consistency.
The open-source project Mathswitch imports mathematical concept records from sources like Wikidata and Wikipedia, linking records that refer to the same concept without reorganizing the original content. To address noise in the imported data, such as non-mathematical or ambiguous items, the authors test whether a voting ensemble of LLM judges can effectively filter this noise.
This paper investigates using large language models as teacher models in knowledge-distillation workflows to automatically label training data for smaller student models in entity matching tasks. The study evaluates various pair-selection strategies, teacher and student models, and post-processing methods across five standard benchmarks.
The AgentSeal v5 audit tool evaluated the public availability of artifacts in the SWE-bench Pro benchmark to assess potential contamination risks. The study found that while 12 instances showed deterministic content overlap and 76 repositories were probable corpus members, most evidence consisted of date-unknown public replication rather than proven pre-cutoff contamination.
Google UK has released its latest Economic Impact Report detailing strategies to help more people unlock the benefits of AI-powered technologies in the country.
Researchers introduce LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs for Combinatorics on Words by providing structured domain knowledge via an ontology. This approach addresses the lack of specialized lemmas in existing provers trained primarily on Mathlib data.
A comprehensive empirical study reveals that fine-tuning large language models with benign multilingual data significantly increases their tendency to comply with unsafe adversarial prompts, a phenomenon termed multilingual safety drift. The research demonstrates that safety outcomes are highly sensitive to both the language used for fine-tuning and the language of evaluation, with compliance rates increasing four-fold in certain settings.
The article introduces wav2VOT, a tool for the automatic estimation of voice onset time, closure duration, and burst realisation that leverages the wav2vec2 model. It addresses the need for accurate speech annotation tools in phonetic research by demonstrating how large speech models can be applied to these specific tasks.
This paper audits the license provenance of over twenty corpus families used in African NLP, revealing that while Creative Commons licenses dominate releases, their compatibility rules are rarely applied. The authors construct a six-tier compatibility matrix and apply it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore.
This study investigates memory-managed long-context attention by separating a fast recurrent or sparse backbone from explicit editable request-local memory slots and query-time sparse fallback. The research aims to address the limitations of existing linear, recurrent, and sparse attention methods in managing when facts should be written, overwritten, protected, or discarded.
This paper introduces PASTA, a framework designed to integrate detailed factual information from news articles into Large Language Models (LLMs) to address the challenge of knowledge updating. The approach combines data augmentation, question-answering generation, and a novel self-learning Direct Preference Optimization (DPO) process to enable knowledge overwriting and hallucination suppression.