Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

PEC-Home: Simulated Dataset for Elliptical Command Interpretation

PEC-Home is the first simulated dataset designed to enable smart home assistants to interpret progressively elliptical commands. Experiments show that even with dialogue history tools, LLMs like GPT-4o fail to achieve accurate command execution from elliptical inputs, highlighting a significant gap in current assistant capabilities.

arxiv arXiv cs.CL · 8d ago

EARS Framework Enhances Multi-Agent System Reliability

EARS introduces explanatory abstention in sub-agents to improve reliability in large-scale multi-agent systems. By providing actionable failure rationales to coordinators, EARS increases the overall response pass rate from 68.5% to 78.9% in a production e-commerce assistant.

arxiv arXiv cs.CL · 8d ago

ForecastBench-Sim: Simulated World Forecasting Benchmark

ForecastBench-Sim is a simulated-world forecasting benchmark using Freeciv game rollouts. It enables continuous or binary forecasts at arbitrary horizons, with intervention worlds for causal questions and rare outcomes, and provides immediate, resolvable feedback for evaluating probabilistic reasoning in dynamic environments.

arxiv arXiv cs.CL · 8d ago

Frustrated Synchronization Network Outperforms Transformers

The Frustrated Synchronization Network (FSN) achieves lower validation loss than a RoPE-SwiGLU transformer at every epoch on character-level text and code tasks. At one million parameters, FSN converges to a validation loss of 1.5953 ± 0.0014, outperforming the transformer's converged loss of 1.611. This advantage persists up to four million parameters, with ongoing evaluations beyond that scale.

arxiv arXiv cs.CL · 8d ago

TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.

arxiv arXiv cs.CL · 8d ago

LLMs Struggle to Capture Item Discrimination in Reading Assessments

A study finds that large language models fail to reliably measure item discrimination in reading comprehension assessments. While some models show weak alignment with human-calibrated scores—ranging from 0.152 to 0.241—current LLMs do not adequately capture how assessment items distinguish students of different proficiency levels.

arxiv arXiv cs.CL · 8d ago

Output Vector Editing Reduces Memorization in LLMs

A new method called output vector editing minimally modifies MLP neurons' output vectors to suppress memorized sequences in large language models, achieving up to 87.9% suppression in OLMo-7B. This approach outperforms zeroing neuron activations by a factor of 2.7 and works across four models from 36-7B parameters, with success rates scaling with model size and showing consistent performance across architectures.

arxiv arXiv cs.CL · 8d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 8d ago

DICE Improves Long-Document Retrieval with Chunk Evidence Aggregation

DICE, a training-free method, splits long documents into chunks, encodes them independently, and aggregates the results into a single vector. It reduces the Evidence Dilution Index in 92.8% of cases on LongEmbed, significantly improving retrieval performance for slices over 4k tokens across four backbones.

arxiv arXiv cs.CL · 8d ago

RedactionBench: A Benchmark for Contextual Privacy in AI

RedactionBench introduces a manually annotated benchmark of 200 diverse documents across 11 domains to evaluate privacy-preserving redaction. It features R-Score, a character-level metric that treats semantically similar redactions equally and reduces bias from formatting choices. Human evaluations reveal significant disagreement on contextual redactions (47.7% consensus), highlighting the subjective nature of privacy and motivating the need for standardized, context-aware benchmarks.

arxiv arXiv cs.CL · 8d ago

HandwritingAgent: Language-Driven Handwriting Synthesis in SVG

HandwritingAgent synthesizes natural handwriting in SVG format without style-specific training. It uses a large reasoning model to generate stroke sequences in a grid canvas, conditioned on text input and a reference style image, enabling efficient, controllable, and generalizable handwriting generation.

arxiv arXiv cs.CL · 8d ago

LLM-based Metrics Improve Clinical Significance Evaluation in Radiology

A study introduces lightweight, interpretable metrics that sharpen the boundary between clinically significant errors and harmless variations in radiology reports. These metrics outperform large medical LLMs and rival proprietary models, with one-pass training proven effective for cost-sensitive deployment. The two-pass setting fails to consistently improve performance and shifts focus from error detection to robustness.

arxiv arXiv cs.CL · 8d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 8d ago

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning

ScholarSum introduces a hierarchical knowledge graph framework that emulates a student-teacher process for scientific summarization. It generates fluent, factually consistent summaries by first structuring documents into semantic units, then refining drafts through evidence retrieval and iterative review by a teacher-like component. Experiments show ScholarSum outperforms existing methods in completeness and factual faithfulness.

arxiv arXiv cs.CL · 8d ago

ImpSH Improves Implicit Hate Speech Detection Across Domains

ImpSH, a triplet-based framework, aligns posts with implied statements and uses context-bounded semi-hard negatives to enhance detection of implicit hate speech. Evaluations on IHC, SBIC, and DynaHate show ImpSH outperforms standard supervised contrastive methods in cross-domain settings, with improved representation stability and reduced false negatives under domain shifts.

arxiv arXiv cs.CL · 8d ago

Approximate Structured Diffusion for Sequence Labelling

A new method uses diffusion to train CRFs on entire label sequences, conditioning on noisy labels. When combined with approximate inference, it reduces POS-tagging error by 16.5%.

arxiv arXiv cs.CL · 8d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.CL · 8d ago

Rubric-Guided Counterfactual Recommendations for Medical Communication

A new pipeline uses language models to recommend minimal, interpretable changes to patient-doctor communication features like tone and personalization. These changes increase predicted positive feedback by an average of 6.41% and are non-negative for 93.31% of cases, without altering medical content.

arxiv arXiv cs.CL · 8d ago

RPCL Improves Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework, enhances pair confidence in multimodal emotion-cause pair extraction by enforcing discriminative and stable confidence margins. It outperforms a base model on ECF, MECAD, and MEC4 by 2.58 to 2.83 percentage points in Pair F1 and improves mean Pair AUPRC across datasets, with stronger separation between gold pairs and hard negatives.

arxiv arXiv cs.CL · 8d ago

REVES: Augmented Training for Test-Time Scaling

REVES introduces a two-stage iterative framework that enhances large language model reasoning through sequential revision and verification. It achieves +6.5 points over RL baselines and +4.0 points over standard multi-turn training on LiveCodeBench, using a 4B base model with fewer rollouts than larger systems. The method improves error correction and generalizes to out-of-distribution puzzles like n_queens and mini_sudoku.