All articles
arxiv arXiv cs.CL · 5h ago

Clinical Evidence Strength Is Recoverable From LLM Representations, Not Stated Grades

A study of 22 open-weight large language models reveals that while the strength of clinical evidence can be recovered from model activations and text, the grades explicitly stated by the models are no better than chance. Researchers analyzed 45,134 clinical claims harmonized into four-level evidence grades to test whether models register and express evidence strength distinct from factual truth.

arxiv arXiv cs.CL · 6h ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Researchers introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches Large Language Models to evolve solutions across diverse tasks by converting evolutionary search trajectories into supervision. This approach addresses the limitation of prior methods that discard accumulated experience, enabling models to reuse discovery capabilities rather than solving new problems from scratch.

arxiv arXiv cs.CL · 7h ago

Pre-Registered Screening Rule for Evolutionary Outer Loops

The authors introduce a pre-registered screening rule that determines before implementation whether an evolutionary outer loop over neural network parameters is worth building compared to a cheap single-shot alternative. The rule calculates a recovery metric R, defined as the best single-shot gain divided by the best gain of any cheap method, and prescribes skipping the outer loop when R is greater than or equal to 90%.

arxiv arXiv cs.CL · 7h ago

Evidence-Informed LLM Beliefs for Continual Scientific Discovery

The article addresses the limitation of AutoDiscovery's use of static "Bayesian surprise" by introducing evidence-informed LLM beliefs, where priors are updated with evidence from previous hypotheses to compute non-stationary surprisal. The authors find that embedding-based retrieval-augmented generation over prior discoveries best anticipates eventual posteriors and identify 37.5% of static surprisals as spurious.

arxiv arXiv cs.CL · 8h ago

PolicyGuard: A Dialogue-Grounded Sub-Agent Verifier for Policy Adherence in LLM Agents

Researchers introduce PolicyGuard, a sub-agent verifier designed to improve policy adherence in LLM agents by reasoning over the full dialogue context rather than relying on external checks of individual arguments. This approach addresses the limitations of prior safeguarding methods that often underestimate the need for conversation-specific remediation and explicit user confirmation.