All articles
arxiv arXiv cs.CL · 4h ago

Clinical Evidence Strength Is Recoverable From LLM Representations, Not Stated Grades

A study of 22 open-weight large language models reveals that while the strength of clinical evidence can be recovered from model activations and text, the grades explicitly stated by the models are no better than chance. Researchers analyzed 45,134 clinical claims harmonized into four-level evidence grades to test whether models register and express evidence strength distinct from factual truth.

arxiv arXiv cs.CL · 5h ago

Evolution Fine-Tuning: Learning to Discover Across 371 Optimization Tasks

Researchers introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches Large Language Models to evolve solutions across diverse tasks by converting evolutionary search trajectories into supervision. This approach addresses the limitation of prior methods that discard accumulated experience, enabling models to reuse discovery capabilities rather than solving new problems from scratch.

arxiv arXiv cs.CL · 6h ago

Pre-Registered Screening Rule for Evolutionary Outer Loops

The authors introduce a pre-registered screening rule that determines before implementation whether an evolutionary outer loop over neural network parameters is worth building compared to a cheap single-shot alternative. The rule calculates a recovery metric R, defined as the best single-shot gain divided by the best gain of any cheap method, and prescribes skipping the outer loop when R is greater than or equal to 90%.