Evaluation & benchmarks — korshunov.ai

Evaluation & benchmarks Page 1 / 42

Aspect-Based Sentiment Evolution in Multi-Round Peer Reviews

A deep learning study analyzes sentiment evolution across review rounds in 11,063 Nature Communications papers. As review rounds increase, positive sentiments rise and negative ones decline, with aspect-level sentiments showing a negative correlation to the total number of rounds, particularly in 'experiments', 'research significance', and 'result analysis'.

arxiv arXiv cs.CL · 1d ago

ReCARE: Robust erasure for co-occurring retained concepts in diffusion unlearning

ReCARE introduces a framework that preserves benign co-occurring concepts during unlearning by defining CARE (Co-occurring Associated REtained concepts) and using a CARE score to quantify their retention. It automatically constructs a CARE-set from target images and integrates it into training to ensure stable unlearning while erasing only the target concept.

arxiv arXiv cs.CL · 1d ago

Dialogue to Discovery: Attribute-Aware Preference Elicitation

Dialogue to Discovery (D2D) is an attribute-oriented framework that improves conversational product search by dynamically guiding user interactions. It adapts query priorities and recommendation timing, achieving 22.2-29.9% higher target-finding accuracy, 6.6-16.1% lower abandonment, and 27.5% shorter conversations compared to existing methods, with user studies confirming improved satisfaction and efficiency.

arxiv arXiv cs.CL · 1d ago

Decoherence as Defence in Quantum Neural Networks for Intrusion Detection

A rigorous N-qubit theory proves that depolarising noise in stochastic quantum neural networks contracts Pauli read-outs exponentially, enabling robust anomaly detection. On the NSL-KDD dataset, such noise achieves significant adversarial resilience without catastrophic collapse, outperforming noiseless models and classical detectors under FGSM and PGD attacks, with reduced robustness variance and a train-test gap reduction of approximately 0.01.

arxiv arXiv cs.CL · 1d ago

CALIBER: Calibrating Confidence Before and After Reasoning in Language Models

CALIBER introduces a method that elicits and supervises confidence estimates at two stages: before and after reasoning. It reduces Expected Calibration Error by 52.5% on BigMathDigits for a 7B model, achieving the best Brier score and AUROC, and performs best on out-of-distribution benchmarks like GPQA and TriviaQA.

arxiv arXiv cs.CL · 1d ago

SURGELLM: Task-Aware Feature Gating with Class-Balanced Normalization

SURGELLM introduces a unified transformer framework with surgical feature gating, task-conditioned prefix tokens, and Instance-Weighted Normalization to address inductive bias mismatches, class imbalance, and lack of lexical knowledge integration. The IWN variant achieves macro-F1 of 0.940 across four tasks, outperforming baselines by 0.036 overall and 0.130 on authorship detection, with gains confirmed as lexical rather than parametric.

arxiv arXiv cs.CL · 1d ago

Bad Prompts Cause Model Collapse and Mistakes

Bad contexts in conversations can lead to 'pigeonholing', where models repeat incorrect answers or narrow down to a single response. Experiments show performance drops of 38-40% and worsening errors with more conversation turns, even when initial inputs are correct. A new method, RLVR with synthetic errors, improves model performance by 43-60% under such bad contexts.

arxiv arXiv cs.CL · 1d ago

AVOC: Retrieval-Inspired Token Compression for Long-Form Audio-Video Understanding

AVOC enhances long-form audio-video understanding in omni-modal LLMs by introducing a learnable token compression module. It reframes token selection as a top-K retrieval problem, using relevance, importance, and diversity criteria to select compact, informative tokens, achieving state-of-the-art results on OmniVideoBench and LVOmniBench, and maintaining strong performance on one-hour audio-video needle-in-a-haystack tasks.

arxiv arXiv cs.CL · 1d ago

Transformer Models: Architectures, Applications, and Critical Assessment

This review presents a taxonomy of transformer-based language models across domain verticals, covering encoder-only, decoder-only, encoder-decoder, long-context, permutation-based, and generator-discriminator variants. It evaluates post-2023 advancements like instruction tuning and mixture-of-experts scaling, and assesses model deployments in healthcare, finance, legal, education, customer service, creative writing, and scientific work, linking each to specific capabilities. The paper critically analyzes model architectures on four key deployment axes, quantifies parameter count versus energy cost, and examines how alignment methods, data provenance, and benchmark saturation define 'state of the art'.

arxiv arXiv cs.CL · 1d ago

PETRA: Dataset and Pipeline for Petroleum Engineering Text Adaptation

PETRA transforms public web text into a curated petroleum engineering corpus with synthetic supervision for dense retrieval and reranking. It improves in-domain nDCG from 0.703 to 0.763 and boosts Earth Science benchmark performance by 44% and a six-task reasoning panel by 23%.

arxiv arXiv cs.CL · 1d ago

MorfFlex: Managing Rich Morphology in Czech

MorfFlex is a morphological dictionary architecture designed for languages with complex inflection and derivation. MorfFlex CZ, its primary implementation, contains over 100 million wordforms and more than 1 million lemmas, reduced through encoded inflectional and derivational patterns. It supports annotation consistency in the Prague Dependency Treebanks and powers tools like MorphoDiTa.

arxiv arXiv cs.CL · 1d ago

ComputeFHE: A Privacy-Preserving General-Purpose Computation Library

ComputeFHE is an open-source C++ library that enables privacy-preserving computation using the TFHE cryptosystem. It offers encrypted integer and fixed-point data types with arithmetic and logical operations, supporting both standard and optimized FHE-friendly ALU architectures. Experimental results show up to 3.9x performance improvements and reduced bootstrapping operations, with a simulation mode for testing and complexity analysis without cryptographic execution.

arxiv arXiv cs.CL · 1d ago

Stability of Prompt Ranking in LLM Evaluation

Prompt rankings in large language model evaluation are often unstable under minor variations like random seeds and limited subsets. A stability-aware selection strategy using lower confidence bounds improves robustness by accounting for both performance and variance, while maintaining competitiveness in stable settings.

arxiv arXiv cs.CL · 1d ago

AutoSpecNER: Fine-Grained NER Dataset for Vehicle Specifications

AutoSpecNER is a dataset of 659 car advertisements with over 10,000 entities annotated across 15 categories. It achieves 91.5% inter-annotator agreement and shows that DeBERTa outperforms both rule-based methods and large language models in vehicle specification extraction, reaching a 90% micro-F1 score.

arxiv arXiv cs.CL · 1d ago

Age of LLM: Benchmark for LLM Reasoning and Diplomacy

Age of LLM introduces a turn-based 1v1 benchmark where two LLMs compete on a 13x7 grid under fog of war, full diplomacy, and strict JSON reliability rules. Findings show the nuclear rush dominates, diplomacy is prolific but rarely succeeds, and illegal actions reveal belief-tracking errors, with a weak link between reliability and victory. The corpus is small and unbalanced, and the results offer a preliminary view of LLM reasoning under adversarial uncertainty.

arxiv arXiv cs.CL · 1d ago

ExtractConf: Multi-Signal Confidence Engine for LLM Document Extraction

ExtractConf introduces a confidence engine that uses dual LLM readings—field-guided and document-guided—to detect unreliable extractions. It fuses disagreement between calls, LLM uncertainty, and document signals into a classifier, achieving 0.928 ROC AUC on invoices and reducing selective prediction risk by 70%.

arxiv arXiv cs.CL · 1d ago

EDV Framework Enables Reliable Experience Learning for Agentic Systems

The EDV framework introduces an Execute-Distill-Verify paradigm to overcome the self-confirmation trap in large language model agents. By using multiple agents to explore tasks, a third-party agent to distill experiences, and a consensus-based verification step, EDV ensures only accurate experiences are stored in memory. Evaluation on tau2-bench, Mind2Web, and MMTB shows EDV outperforms strong baselines, demonstrating its effectiveness in enabling robust agent self-evolution.

arxiv arXiv cs.CL · 1d ago

LLM-based Two-Stage Transformer for Bearing Fault Diagnosis

A lightweight GPT-2-style Transformer enables hierarchical feature extraction from vibration signals. The framework achieves 92.61% average accuracy using only 10% labeled data, outperforming state-of-the-art methods by 17.24 percentage points in cross-domain bearing fault diagnosis.

arxiv arXiv cs.CL · 1d ago

African Language Tokenization Penalty in Frontier LLMs

African languages face a tokenization premium of 1.88x to 8.92x compared to English in frontier LLMs, with Ethiopic and N'Ko scripts bearing the highest costs. This penalty translates to up to 8.9x higher inference costs and reduced context capacity, with some languages receiving as little as 11% of English's effective context window. The penalty persists across corpora and is not eliminated by current tokenizers, highlighting a structural digital divide.

arxiv arXiv cs.CL · 1d ago

UOL@IDEM Submits L1-Aware Vocabulary Prediction Model

UOL@IDEM presents a closed-track submission to BEA 2026, modeling vocabulary difficulty prediction as regression for Spanish, German, and Chinese. The system integrates multilingual contextual embeddings with engineered features like frequency and cognate similarity, achieving lower RMSE scores than baselines, with feature analysis highlighting frequency as the most stable predictor and contextual predictability as a key L1-sensitive signal.