Evaluation & benchmarks
arxiv arXiv cs.CL · 1d ago

Linguistic Fingerprints Reveal Tang Poets' Regional Origins

A computational analysis of the Complete Tang Poems shows that poets' geographic origins leave detectable linguistic traces. Models using character n-gram TF-IDF and domain features achieve 0.69 accuracy in predicting broad regional origin (South vs. North), surpassing chance, and correctly classify finer circuit-level origins. The study finds linguistic distance between circuits correlates with geographic distance, with regional divergence increasing in the Late Tang, and highlights historical biases in early Tang poetic style.

arxiv arXiv cs.CL · 1d ago

First Large-Scale Analysis of Algorithm Co-Occurrence Networks

This study analyzes algorithm influence through co-occurrence networks in natural language processing, using full-text academic papers. It reveals that algorithm networks exhibit complex network features, with denser connections emerging over two decades, and that classic algorithms at research intersections show high centrality and balanced influence. The research provides a temporal and structural view of algorithm evolution and lays groundwork for future studies on algorithm, scholar, and task networks.

arxiv arXiv cs.CL · 1d ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.