Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 43

ParaPairAudioBench: Benchmark for Paralinguistic Speech Evaluation

ParaPairAudioBench introduces a pairwise benchmark of 5,175 audio pairs across five paralinguistic dimensions. It reveals that current LALM judges lag human judgments by 32% on average and fail to calibrate, especially in tie cases where abstention is correct.

arxiv arXiv cs.CL · 1d ago

AI-PAVE-Br: LLM-Based PAVE for Brazilian E-Commerce

AI-PAVE-Br uses large language models to enhance product attribute value extraction in Brazilian e-commerce. The system outperforms traditional NER methods, with a new Golden Set dataset providing a manually annotated benchmark for Portuguese product data.

arxiv arXiv cs.CL · 1d ago

DREAM: Autoregressive Training for Dense Retrieval Embeddings

DREAM uses autoregressive next-token prediction to supervise dense retrieval embedding training. It injects query-document similarity scores into a frozen LLM's attention heads, enabling gradient backpropagation for retriever optimization. DREAM outperforms baselines on BEIR and RTEB benchmarks across model scales.

arxiv arXiv cs.CL · 1d ago

CN-NewsTTS Bench v0.1 Released

CN-NewsTTS Bench v0.1 is an open benchmark for evaluating Chinese news TTS systems' ability to correctly pronounce raw text targets. It includes 200 development and 800 public test records, 992 auto-evaluable targets, and results for seven TTS systems, with the best achieving 0.879 strict accuracy and several below 0.60.

arxiv arXiv cs.CL · 1d ago

Task Decomposition for Efficient Annotation

We propose decomposing structured annotation tasks into sub-tasks to reduce overall inferential load. By identifying salient anchor entities—centers in the space of valid annotations—we constrain output complexity and improve cost-efficiency. We provide guidelines for decomposition and a procedure to allocate sub-tasks across human and model annotators for optimal quality under fixed budgets.

arxiv arXiv cs.CL · 1d ago

CANDLE: Lightweight Arabic Noise Deduplication via CTC

CANDLE is a lightweight system that uses Connectionist Temporal Classification to deduplicate repeated characters in Arabic text, without relying on handcrafted rules or morphological analyzers. It achieves a Sentence Error Rate of 5.37% and reduces tokenizer fertility by up to 12.8%, lowering inference costs and improving context window usage.

arxiv arXiv cs.CL · 1d ago

Are We Ready For An Agent-Native Memory System?

A new study decomposes agent memory into four core modules and evaluates 12 systems across five benchmark workloads. It finds no single architecture dominates, with performance dependent on alignment with workload bottlenecks, and reveals that localized maintenance is more cost-efficient than global reorganization.

arxiv arXiv cs.CL · 1d ago

L3Cube-MahaPOS: Marathi POS Tagging Dataset and BERT Models

L3Cube-MahaPOS introduces a gold-standard part-of-speech tagging dataset for Marathi, manually annotated with 32,354 sentences from news text. It includes a 16-tag Universal Dependencies scheme and benchmarks six model families, achieving 88.67% token-level accuracy and 81.67% macro-F1 on 15 tag classes using MahaBERT-v2.

arxiv arXiv cs.CL · 1d ago

Quality-Aware Training Data Selection for Scientific Summarization

We construct and release a large biomedical dataset with 1.88 million PMC articles. Analysis shows author-written abstracts vary in quality and alignment with source articles, enabling effective training-data selection. Training on high-quality subsets outperforms random sampling and matches larger random subsets on factuality metrics.

arxiv arXiv cs.CL · 1d ago

Linguistic Fingerprints Reveal Tang Poets' Regional Origins

A computational analysis of the Complete Tang Poems shows that poets' geographic origins leave detectable linguistic traces. Models using character n-gram TF-IDF and domain features achieve 0.69 accuracy in predicting broad regional origin (South vs. North), surpassing chance, and correctly classify finer circuit-level origins. The study finds linguistic distance between circuits correlates with geographic distance, with regional divergence increasing in the Late Tang, and highlights historical biases in early Tang poetic style.

arxiv arXiv cs.CL · 1d ago

First Large-Scale Analysis of Algorithm Co-Occurrence Networks

This study analyzes algorithm influence through co-occurrence networks in natural language processing, using full-text academic papers. It reveals that algorithm networks exhibit complex network features, with denser connections emerging over two decades, and that classic algorithms at research intersections show high centrality and balanced influence. The research provides a temporal and structural view of algorithm evolution and lays groundwork for future studies on algorithm, scholar, and task networks.

arxiv arXiv cs.CL · 1d ago

PORTER: Language-Grounded Event Representations for Portable EHR Foundation Models

PORTER introduces a language-grounded structured EHR foundation model that represents clinical events via descriptions instead of fixed vocabularies. It achieves superior performance across 74 pediatric prediction tasks and transfers effectively to new vocabularies without retraining, recovering 97.1% of target AUROC and outperforming fixed-vocabulary models on MIMIC, with 329-fold lower compute than text serialization approaches.

arxiv arXiv cs.CL · 1d ago

LoRA Monitor Calibration Fails with Top-1 in Diffusion LMs

Top-1 argmax concentration fails as a collapse warning in LoRA-optimized diffusion language models, showing zero precision across 816 configurations. Max LoRA gradient norm outperforms this baseline, achieving 0.68 precision and 0.79 F1 on a held-out LLaDA split, though results are limited to short-horizon, family-specific inspections.

arxiv arXiv cs.CL · 1d ago

Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning

HDS introduces a multi-objective reinforcement learning framework for online data mixing in LLM pre-training. It achieves 44% fewer training iterations on The Pile benchmark and improves MMLU 0-shot performance by 7.2%, with consistent gains across other benchmarks.

arxiv arXiv cs.CL · 1d ago

InterAligner: Progressive Alignment for ASR

InterAligner introduces an intermediate aligner objective and InterCTC loss to enable progressive alignment formation in deep ASR models. On LibriSpeech with a 17-layer Conformer, it reduces WER from 5.0/7.8 to 3.1/5.6, with significant improvements on long utterances.

arxiv arXiv cs.CL · 1d ago

Metis: Bridging Text and Code Memory for Self-Evolving Agents

Metis introduces a hierarchical dual-representation memory that combines text and code memory to improve self-evolving agents. It organizes experience into execution plans, facts, and pitfalls, crystallizing reusable plans into validated tools only when justified. Evaluated on AppWorld, Metis achieves up to 20.6% higher task accuracy and 22.8% lower execution cost than ReAct, with better overall balance across accuracy, efficiency, and memory cost.

arxiv arXiv cs.CL · 1d ago

MedBench v5: Dynamic Benchmark for Clinical AI

MedBench v5 introduces a dynamic, process-oriented benchmark for clinical multimodal models, featuring clinical cognitive responsiveness and atomic skills across 63 tasks. It includes stressors for degradation analysis and monitors hallucination propagation through five reasoning nodes, revealing that strong task performance does not ensure process stability.

arxiv arXiv cs.CL · 1d ago

BehaviorBench Launches Benchmark for Behavioral AI Models

BehaviorBench introduces a comprehensive benchmark to evaluate foundation models across four behavioral science capabilities: behavior prediction, strategic decision-making, subject-trait inference, and knowledge application. It assesses models at both individual and distributional levels, revealing that behavioral foundation models like Be.FM-1.5 achieve stronger distributional alignment than general-purpose models, highlighting the need for distributional evaluation in behavioral AI.

arxiv arXiv cs.CL · 1d ago

CORE-BREW: LLR-Based Soft Decoding for Robust Multi-Bit LLM Watermarking

CORE-BREW introduces a soft-decision decoding method using calibrated log-likelihood ratios to enable robust multi-bit watermarking in LLMs. It achieves consistent hit rates and improved false-positive control through strict and FPR-calibrated detection modes, outperforming prior baselines under token-level edits and paraphrasing while preserving semantic quality.

arxiv arXiv cs.CL · 1d ago

Pāninian Foundation for Indic Language Processing

A new benchmark suite proposes leveraging Pānini's ancient grammar as a unifying framework for Indic language processing. This approach aims to improve accuracy, data efficiency, and transferability by grounding NLP tools in a shared morphosyntactic architecture. The framework raises questions about whether neural models internally represent Pānini's linguistic categories.