Qwen3.6-27B with 3-Critic Harness Matches Frontier Quality
A user tested Qwen3.6-27B (8-bit) alongside GLM5.2 using a coding harness that employs three critics—code review, test review, and Playwright e2e—to validate output quality.
A user tested Qwen3.6-27B (8-bit) alongside GLM5.2 using a coding harness that employs three critics—code review, test review, and Playwright e2e—to validate output quality.
This paper introduces DriftGuard, a framework that combines multi-monitor drift detection with selective model updating to address evolving toxicity in automated moderation systems. The system tracks specific safety-relevant shifts, such as identity-harm and toxic-risk drift, rather than relying solely on global distributional changes.
The authors introduce 5ting, a system designed for the SemEval-2026 Task 8 (MTRAGEval) which evaluates multi-turn Retrieval Augmented Generation (RAG) systems. The system addresses challenges such as context drift, under specification, and hallucination risk by combining dense retrieval with LLM-based reranking and faithfulness control.
The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.
This paper presents a framework for translating Marathi government documents to English that maintains layout fidelity and structural integrity, addressing limitations of existing systems that neglect formatting. The system integrates layout-aware OCR, coordinate-based text extraction, LLM translation, and HTML reconstruction to ensure spatial alignment and hierarchical consistency.
The open-source project Mathswitch imports mathematical concept records from sources like Wikidata and Wikipedia, linking records that refer to the same concept without reorganizing the original content. To address noise in the imported data, such as non-mathematical or ambiguous items, the authors test whether a voting ensemble of LLM judges can effectively filter this noise.
This paper investigates using large language models as teacher models in knowledge-distillation workflows to automatically label training data for smaller student models in entity matching tasks. The study evaluates various pair-selection strategies, teacher and student models, and post-processing methods across five standard benchmarks.
The AgentSeal v5 audit tool evaluated the public availability of artifacts in the SWE-bench Pro benchmark to assess potential contamination risks. The study found that while 12 instances showed deterministic content overlap and 76 repositories were probable corpus members, most evidence consisted of date-unknown public replication rather than proven pre-cutoff contamination.
Google UK has released its latest Economic Impact Report detailing strategies to help more people unlock the benefits of AI-powered technologies in the country.
Researchers introduce LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs for Combinatorics on Words by providing structured domain knowledge via an ontology. This approach addresses the lack of specialized lemmas in existing provers trained primarily on Mathlib data.
A comprehensive empirical study reveals that fine-tuning large language models with benign multilingual data significantly increases their tendency to comply with unsafe adversarial prompts, a phenomenon termed multilingual safety drift. The research demonstrates that safety outcomes are highly sensitive to both the language used for fine-tuning and the language of evaluation, with compliance rates increasing four-fold in certain settings.
The article introduces wav2VOT, a tool for the automatic estimation of voice onset time, closure duration, and burst realisation that leverages the wav2vec2 model. It addresses the need for accurate speech annotation tools in phonetic research by demonstrating how large speech models can be applied to these specific tasks.
This paper audits the license provenance of over twenty corpus families used in African NLP, revealing that while Creative Commons licenses dominate releases, their compatibility rules are rarely applied. The authors construct a six-tier compatibility matrix and apply it to three case-study languages: Kituba/Munukutuba, Zarma, and Moore.
This study investigates memory-managed long-context attention by separating a fast recurrent or sparse backbone from explicit editable request-local memory slots and query-time sparse fallback. The research aims to address the limitations of existing linear, recurrent, and sparse attention methods in managing when facts should be written, overwritten, protected, or discarded.
This paper introduces PASTA, a framework designed to integrate detailed factual information from news articles into Large Language Models (LLMs) to address the challenge of knowledge updating. The approach combines data augmentation, question-answering generation, and a novel self-learning Direct Preference Optimization (DPO) process to enable knowledge overwriting and hallucination suppression.
The authors introduce MedEvoEval, an executable longitudinal evaluation framework designed to assess the continual evolution of doctor agents through simulated outpatient clinical episodes. This system moves beyond static benchmarks by tracking how agents acquire evidence, utilize resources, and refine their decision-making across multiple interactions.
The authors introduce GRAB, a constructor-encoder-bridge pipeline designed for table question answering that lifts relational data into a heterogeneous graph and encodes it via message passing. The method transfers signals to a frozen large language model through a small set of query-conditioned latent tokens, providing a compact structural representation while preserving the LLM's general reasoning capabilities.
Researchers introduce FinInvest-GTCN, a Graph-Temporal-Causal Network designed to optimize venture capital investment decisions by addressing challenges like heterogeneous data and non-stationary time series. The model redefines the task from content recommendation to quantitative risk-return assessment, utilizing a relational graph encoder, multi-scale temporal fusion, and a causal decision head to generate interpretable predictions.
The authors introduce the Electro-Visual-Language Assistant (EVLA), a framework that integrates multi-modal scene understanding with real-time perception of an electrified powertrain's electro-mechanical state to improve driving decisions. This approach addresses the limitation of existing vision-language models that treat vehicle dynamics as a black box by incorporating physical constraints and optimization objectives.
The A3M framework addresses the challenges of learning to bid in repeated multi-unit auctions by integrating adaptive deep reinforcement learning, adversarial reasoning, and multi-objective reward design. It utilizes an actor-critic backbone and opponent modeling to optimize strategy against non-stationary adversaries while balancing utility, revenue, and fairness.