All articles — korshunov.ai

All articles Page 1 / 129

CoT Transformers Can Efficiently Simulate Word RAM Algorithms

Chain-of-thought (CoT) transformers can efficiently simulate Word RAM algorithms with only poly-logarithmic overhead. This efficiency improves to log-square for flat instruction sets and logarithmic for multiplication-free ones, contrasting with prior Turing machine simulations that require quadratic overhead.

arxiv arXiv cs.CL · 12d ago

Sentiment Analysis Misses Key Customer Outcomes

A study of 70,450 support conversations found that sentiment analysis poorly captures customer satisfaction, with GPT-5.4-based satisfaction estimates correlating 0.47 with ratings versus sentiment's 0.36. The model also revealed 44% of conversations where tone and satisfaction diverge, exposing 'tolerated friction'—satisfied customers still reporting fixable issues—unseen by sentiment analysis.

arxiv arXiv cs.CL · 12d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

arxiv arXiv cs.CL · 12d ago

NEST: Dataset for Narrative Event Structures in Long Videos

NEST introduces a dataset of 1005 full-length movies, each annotated with 102 multimodal narrative events grounded in visual, dialogue, and audio content. The dataset captures event relationships such as temporal ordering, hierarchy, and long-range dependencies, with benchmark tasks showing low performance in event detection and localization, and higher performance in event relation extraction after fine-tuning.

arxiv arXiv cs.CL · 12d ago

FineREX: Fine-Tuned NER-RE for Human Smuggling Knowledge Graphs

FineREX is a domain-specific knowledge graph pipeline that uses a fine-tuned LLM for named entity and relationship extraction. It outperforms general-purpose models by 15.50% in entity F1-score and 31.46% in relationship F1-score, reducing legal noise by nearly half and node duplication from 17.78% to 11.-17%. The system also cuts end-to-end processing time by 50.0% by eliminating redundant steps.

arxiv arXiv cs.CL · 12d ago

Introducing P-CHR AUC and CRR for Semantic Caching

We introduce Precision-Cache Hit Ratio (P-CHR) AUC and Calibration Retention Rate (CRR) to address the calibration gap in semantic caching. These metrics evaluate precision across cache utilization levels and measure how offline ranking quality persists in deployment. Our analysis shows the gap is driven by training objectives, not data scale, and post-hoc calibration only partially resolves it.

arxiv arXiv cs.CL · 12d ago

NRITYAM: Benchmark for Cultural Comprehension in Dance

NRITYAM is a multilingual benchmark with 9,260 question-answer pairs across 12 languages, designed to evaluate language models' cultural understanding of global dance traditions. Developed through collaboration with native dance artists and speakers, it offers a comprehensive assessment of AI's ability to grasp traditional performing arts in diverse socio-cultural contexts.

arxiv arXiv cs.CL · 12d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 12d ago

Benchmarking Agentic Review Systems for AI-Assisted Research

A study evaluates four AI review systems across six language models, finding OpenAIReview with GPT-5.5 achieves 83.0% accuracy in matching paper quality to external signals and detects 71.6% of injected errors. Real user feedback shows positive sentiment, with a 1.44-to-1 vote ratio, though false positives and minor nitpicks remain common.

arxiv arXiv cs.CL · 12d ago

Bayesian Curriculum Learning on LLM Latent Manifolds

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a framework that models problem sampling as a structured bandit problem in LLMs' latent space. BMC organizes tasks into a hierarchical tree and uses Bayesian learning to guide sampling, revealing tradeoffs between learning signal, task diversity, and evaluation relevance. Prioritizing difficulty alone fails to achieve strong downstream performance, underscoring the need for structure and type-aware sampling.

arxiv arXiv cs.CL · 12d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 12d ago

CombEval: Benchmark for Combinatorial Counting in LLMs

CombEval is a dynamic benchmark that generates natural-language counting problems with verified answers using typed Cofola specifications. It evaluates 11 large language models and reveals persistent failures in handling ordered objects, indistinguishable elements, positional constraints, and nested dependencies, with errors rooted in constraint interpretation and counting principles.

arxiv arXiv cs.CL · 12d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 12d ago

Semantic Clusters Pre-Train Tsetlin Machine for Interpretability

A new framework pre-trains the Tsetlin Machine using semantic clusters from language models, avoiding embeddings. The method groups text samples into coherent clusters via K-means or Top2Vec, then uses cluster-sample pairs to train a non-negated TM with Type I feedback. Results show superior performance across five datasets, matching BERT-level accuracy while maintaining full interpretability.

arxiv arXiv cs.CL · 12d ago

Credence: Semantic Metrics and Convergence Analysis for Claim Decomposition

Credence introduces Semantic-F1, a BGE-large cosine similarity metric that improves claim decomposition accuracy over Jaccard by 15-32 percentage points. It establishes convergence theorems for rule- and LLM-based repair, showing rule-based repair is finitely terminating and monotone, while LLM-based repair requires early-exit guards. Evaluations across social-media, encyclopaedic, and news domains show EPR from 0.94 to 1.00, with rule-repair reducing atomicity violations by 47-100% without fidelity loss.

arxiv arXiv cs.CL · 12d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 12d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 12d ago

AtomMem: Simple and Effective Memory System for LLM Agents

AtomMem introduces a memory system that stores high-value atomic facts from long-form interactions. It uses hierarchical event structures and temporal profiles to capture coherent episodic contexts and track evolving user attributes, enabling stable and efficient memory evolution. Experiments on the LoCoMo benchmark show AtomMem achieves state-of-the-art performance in reasoning tasks.

arxiv arXiv cs.CL · 12d ago

Zero-Shot Agentic LLMs Extract Lung Pathology from Narratives

A zero-shot agentic workflow using open-source LLMs extracts 13 College of American Pathologists synoptic fields from lung resection pathology reports. The best model (GPT-OSS-20B) achieved a Micro-F1 of 0.893, outperforming baseline recall and accurately capturing complex pathologic relations without task-specific training.

arxiv arXiv cs.CL · 12d ago

LLMs Can Process Non-Readable Text with High Semantic Fidelity

Large language models can maintain 99.5% semantic fidelity when processing compact, non-human-readable text forms called BabelTele, even when the text is reduced to 27.9% of its original length. These model-centric representations show strong performance in cross-model transfer, agent memory, and multi-agent communication, suggesting that human readability is not essential for semantic recovery in LLMs.