Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

Quality-Aware Training Data Selection for Scientific Summarization

We construct and release a large biomedical dataset with 1.88 million PMC articles. Analysis shows author-written abstracts vary in quality and alignment with source articles, enabling effective training-data selection. Training on high-quality subsets outperforms random sampling and matches larger random subsets on factuality metrics.

arxiv arXiv cs.CL · 1d ago

Linguistic Fingerprints Reveal Tang Poets' Regional Origins

A computational analysis of the Complete Tang Poems shows that poets' geographic origins leave detectable linguistic traces. Models using character n-gram TF-IDF and domain features achieve 0.69 accuracy in predicting broad regional origin (South vs. North), surpassing chance, and correctly classify finer circuit-level origins. The study finds linguistic distance between circuits correlates with geographic distance, with regional divergence increasing in the Late Tang, and highlights historical biases in early Tang poetic style.

arxiv arXiv cs.CL · 1d ago

First Large-Scale Analysis of Algorithm Co-Occurrence Networks

This study analyzes algorithm influence through co-occurrence networks in natural language processing, using full-text academic papers. It reveals that algorithm networks exhibit complex network features, with denser connections emerging over two decades, and that classic algorithms at research intersections show high centrality and balanced influence. The research provides a temporal and structural view of algorithm evolution and lays groundwork for future studies on algorithm, scholar, and task networks.

arxiv arXiv cs.CL · 1d ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 1d ago

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Top-1 argmax concentration fails as a collapse warning in LoRA-optimized diffusion language models, showing zero precision across 816 configurations. Max LoRA gradient norm outperforms this baseline, achieving 0.68 precision and 0.79 F1 on a held-out LLaDA split, though results are limited to short-horizon, family-specific inspections.

arxiv arXiv cs.CL · 1d ago

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

HDS introduces a multi-objective reinforcement learning framework for online data mixing in LLM pre-training. It achieves 44% fewer training iterations on The Pile benchmark and improves MMLU 0-shot performance by 7.2%, with consistent gains across other benchmarks.

arxiv arXiv cs.CL · 1d ago

InterAligner: Progressive Alignment for ASR

InterAligner introduces an intermediate aligner objective and InterCTC loss to enable progressive alignment formation in deep ASR models. On LibriSpeech with a 17-layer Conformer, it reduces WER from 5.0/7.8 to 3.1/5.6, with significant improvements on long utterances.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

MedBench v5: Dynamic Benchmark for Clinical AI

MedBench v5 introduces a dynamic, process-oriented benchmark for clinical multimodal models, featuring clinical cognitive responsiveness and atomic skills across 63 tasks. It includes stressors for degradation analysis and monitors hallucination propagation through five reasoning nodes, revealing that strong task performance does not ensure process stability.

arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.

arxiv arXiv cs.CL · 1d ago

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

CORE-BREW introduces a soft-decision decoding method using calibrated log-likelihood ratios to enable robust multi-bit watermarking in LLMs. It achieves consistent hit rates and improved false-positive control through strict and FPR-calibrated detection modes, outperforming prior baselines under token-level edits and paraphrasing while preserving semantic quality.

arxiv arXiv cs.CL · 1d ago

Pāninian Foundation for Indic Language Processing

A new benchmark suite proposes leveraging Pānini's ancient grammar as a unifying framework for Indic language processing. This approach aims to improve accuracy, data efficiency, and transferability by grounding NLP tools in a shared morphosyntactic architecture. The framework raises questions about whether neural models internally represent Pānini's linguistic categories.

arxiv arXiv cs.CL · 1d ago

Digi Turbine: A Reliability-Aware PINN Benchmark for Offshore Wind Monitoring

Digi Turbine is a synthetic benchmark that combines a simplified beam model with Winkler soil foundation in its training objective. It uses Bayesian inverse identification and First Order Reliability Method screening to enable reliable state estimation from sparse sensor data. Validation is based on synthetic configurations derived from the NREL 5MW turbine.

arxiv arXiv cs.CL · 1d ago

Aspect-Based Sentiment Evolution in Multi-Round Peer Reviews

A deep learning study analyzes sentiment evolution across review rounds in 11,063 Nature Communications papers. As review rounds increase, positive sentiments rise and negative ones decline, with aspect-level sentiments showing a negative correlation to the total number of rounds, particularly in 'experiments', 'research significance', and 'result analysis'.

arxiv arXiv cs.CL · 1d ago

ReCARE: Robust erasure for co-occurring retained concepts in diffusion unlearning

ReCARE introduces a framework that preserves benign co-occurring concepts during unlearning by defining CARE (Co-occurring Associated REtained concepts) and using a CARE score to quantify their retention. It automatically constructs a CARE-set from target images and integrates it into training to ensure stable unlearning while erasing only the target concept.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

CALIBER introduces a method that elicits and supervises confidence estimates at two stages: before and after reasoning. It reduces Expected Calibration Error by 52.5% on BigMathDigits for a 7B model, achieving the best Brier score and AUROC, and performs best on out-of-distribution benchmarks like GPQA and TriviaQA.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.

arxiv arXiv cs.CL · 1d ago

Bad Prompts Cause Model Collapse and Mistakes

Bad contexts in conversations can lead to 'pigeonholing', where models repeat incorrect answers or narrow down to a single response. Experiments show performance drops of 38-40% and worsening errors with more conversation turns, even when initial inputs are correct. A new method, RLVR with synthetic errors, improves model performance by 43-60% under such bad contexts.