All articles — korshunov.ai

All articles Page 1 / 129

REAR: Test-time Preference Realignment through Reward Decomposition

The authors introduce REAR, a novel framework that extends test-time scaling (TTS) to preference alignment by modeling the task as a realignment problem. This approach addresses the limitation of existing TTS methods, which are typically restricted to verifiable domains like mathematics and coding.

arxiv arXiv cs.CL · 10h ago

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

The authors propose OLIVE, a self-supervised speech representation learning framework that jointly optimizes analysis and synthesis objectives through view-augmented masked latent prediction and waveform reconstruction. This unified approach constrains early encoder features to retain signal-level information while shaping later contextual representations toward invariance for robust downstream performance.

arxiv arXiv cs.CL · 10h ago

MaDI-Bench: An End-to-End Data Integration Benchmark

The Mannheim Data Integration Benchmark (MaDI-Bench) is introduced as the first public benchmark for the end-to-end integration of relational tables, addressing the lack of comprehensive evaluation tools in the field. It covers all steps of the integration process, including schema matching, value normalization, entity blocking, entity matching, and data fusion.

arxiv arXiv cs.CL · 10h ago

Uncovering Salience-Driven Dynamics in Consumer Confidence with Generative Social Simulation

This article introduces ConsumerSim, a generative framework that reconstructs Consumer Confidence Index (CCI) dynamics using a microdata-calibrated synthetic population and various economic signals. The model ranks first among baselines for reconstruction accuracy across U.S., EU27, and Japanese CCI series, particularly during high-salience shocks.

arxiv arXiv cs.CL · 10h ago

MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training

The authors propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm designed to integrate the capabilities of multiple domain-specific reinforcement learning teachers into a single student model. This approach eliminates exposure bias and provides a dense optimization signal by distilling teachers into the student during its own rollouts.

arxiv arXiv cs.CL · 10h ago

RAPS-DA: Regime-Aware Peer Specialization for Robust RAG

The authors propose RAPS-DA, a regime-aware peer specialization framework designed to address the fragility of Retrieval-augmented generation (RAG) when retrieved context conflicts with a model's parametric knowledge. This approach disentangles incompatible learning signals across different reliability regimes by training specialized peers and applying targeted supervision.

arxiv arXiv cs.CL · 11h ago

Permutation-Invariant Embedding Model Fine-Tuning for Structured Metadata Retrieval

The article demonstrates that field order significantly impacts retrieval quality in structured metadata systems because standard fine-tuning causes encoders to rely on absolute position rather than field labels. To address this, the authors propose Permutation-Invariant Fine-Tuning (PI-FT), a method that serializes records under randomly sampled field orders with dropout to bind meaning to labels.

arxiv arXiv cs.CL · 11h ago

Situation Perception: A Necessary Primitive to Artificial Superintelligence

The article argues that current large language models lack a critical capacity called "situation perception," which is essential for achieving artificial superintelligence. This missing ability involves constructing and acting within internal simulations of possible worlds across latent time.

arxiv arXiv cs.CL · 11h ago

SIMAX: A Scalable and Interpretable Framework for Multi-Fidelity and Annotated Clinician-Patient Dialogue Simulation

Researchers developed SIMAX, a framework designed to generate controlled clinical dialogue data with reference behavioral annotations to address the scarcity of scalable evaluation data for AI-driven communication coding systems. The system creates simulated clinician-patient interactions from predefined scenarios, personas, and voice conditions, utilizing specific codebooks to control overall communication quality and countable behaviors.

arxiv arXiv cs.CL · 11h ago

TRACE: Temporal Relationship-Aware Conversational Entrainment Detection in Dyadic Speech

Researchers introduce DyadEE, a dataset for detecting emotional entrainment in dyadic speech, and propose TRACE, a window-level framework that models these interactions as ordered sequences of acoustic embeddings. The study demonstrates that incorporating conversational context and relationship information significantly improves detection accuracy.

arxiv arXiv cs.CL · 11h ago

Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?

This paper introduces Poller (Poetry LLM Evaluator), a novel method that leverages large language models to evaluate poetry understanding by emulating human judgment through role-playing. The approach requires LLMs to adopt the perspective of the poem's author, using detailed information to bridge the gap between automated efficiency and human expertise.

arxiv arXiv cs.CL · 11h ago

FlashMorph: Budget-Constrained Hybrid Layer Selection for Efficient Transformers

FlashMorph is a novel method for converting Transformer models into hybrid architectures that balance full-attention accuracy with linear-attention efficiency by optimizing layer selection as a budget-constrained subset problem. The approach constructs a morphable model with parallel attention branches and jointly optimizes layerwise gates on synthetic data to determine the optimal configuration.

arxiv arXiv cs.CL · 12h ago

Attractor States Emerge in Multi-Turn LLM Conversations

A study investigates whether open-ended large language model discussions exhibit attractor-like behavior by analyzing trajectories across seven models and twenty controversial topics. The research compares self-play and mixed-play dyadic debates to understand how conversations settle into stable sets of behaviors.

arxiv arXiv cs.CL · 12h ago

Uncertainty-Aware Generation and Decision-Making Under Ambiguity

This study evaluates uncertainty-aware decision-making algorithms based on Bayesian decision theory and risk-averse approaches for LLM tasks like tutoring and peer reviewing. The authors use conformal prediction to provide guarantees over strategies and scores, finding that these methods can improve generation utility but require careful implementation under high ambiguity.

arxiv arXiv cs.CL · 12h ago

Scaling the Horizon, Not the Parameters: Reaching Trillion-Parameter Performance with a 35B Agent

Researchers introduce Agents-A1, a 35B Mixture-of-Experts model that achieves performance comparable to trillion-parameter models by scaling the agent horizon rather than parameter count. The approach focuses on extending long-horizon trajectories and unifying heterogeneous agent abilities through a specialized training infrastructure.

arxiv arXiv cs.CL · 12h ago

Self-Evolving World Models for LLM Agent Planning

The paper introduces WorldEvolver, a framework that equips long-horizon LLM agents with reliable foresight by revising deployment-time context without modifying model parameters. It addresses the issue of unreliable predictions degrading decision-making through a self-evolving approach that enhances predictive fidelity and planning performance.

media r/LocalLLaMA · 13h ago

How I'm using local models from real-world coding

The author shares a practical setup for using local large language models on modest hardware, specifically a laptop with 32GB of RAM and an NVIDIA RTX 4070 with 8GB VRAM. The core strategy involves running the Qwen3.6-35B-A3B model locally as a 'small coding agent' while offloading complex planning to a cloud-based GLM 5.2 instance.

arxiv arXiv cs.CL · 13h ago

A Diagnostic Framework and Multi-Evaluator Audit of Evaluator-Driven Preference Dynamics in Self-Adapting LLM Agents

The article documents how measurements from proprietary LLM evaluators can become invalid within weeks, introducing the EPC framework to detect such instability. It applies this diagnostic across eight experimental conditions, revealing that version-conditional instability makes single-snapshot evaluator studies unreliable.

arxiv arXiv cs.CL · 13h ago

The Hidden Cost of Resampling: How Imbalance Correction Degrades Probability Calibration in Tree Ensembles

This study evaluates the impact of resampling methods like SMOTE and random undersampling on probability calibration in tree ensembles, finding that while SMOTE's cost is small, undersampling severely degrades calibration.

arxiv arXiv cs.CL · 13h ago

How Far Do On-Prem Open LLMs Get on Text-to-SQL? A Cross-Family Size x Technique Frontier on BIRD

This study evaluates the performance of open-weight large language models running on-premises for text-to-SQL tasks using a reproducible benchmark on the BIRD development split. It compares three model families across two generations while ablating specific accuracy-enhancing techniques to determine their actual value.