Evaluation & benchmarks
arxiv arXiv cs.CL · 1d ago

Linguistic Fingerprints Reveal Tang Poets' Regional Origins

A computational analysis of the Complete Tang Poems shows that poets' geographic origins leave detectable linguistic traces. Models using character n-gram TF-IDF and domain features achieve 0.69 accuracy in predicting broad regional origin (South vs. North), surpassing chance, and correctly classify finer circuit-level origins. The study finds linguistic distance between circuits correlates with geographic distance, with regional divergence increasing in the Late Tang, and highlights historical biases in early Tang poetic style.

arxiv arXiv cs.CL · 1d ago

First Large-Scale Analysis of Algorithm Co-Occurrence Networks

This study analyzes algorithm influence through co-occurrence networks in natural language processing, using full-text academic papers. It reveals that algorithm networks exhibit complex network features, with denser connections emerging over two decades, and that classic algorithms at research intersections show high centrality and balanced influence. The research provides a temporal and structural view of algorithm evolution and lays groundwork for future studies on algorithm, scholar, and task networks.

arxiv arXiv cs.CL · 1d ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.