All articles — korshunov.ai

All articles Page 1 / 115

An Empirical Analysis of Factual Errors in Human-Written Text and its Application

This study addresses the neglect of factual error detection in human-written text by distilling a taxonomy of errors from newspaper article corrections, revealing categories like kanji misconversions that are absent in current hallucination benchmarks. The authors evaluate vanilla large language models on synthesized test cases and real corrections to assess their performance on this specific task.

arxiv arXiv cs.CL · 9h ago

Multi-Stage Explainable Framework for Speech-Based Cognitive Impairment Detection

Researchers propose a multi-stage explainability framework that translates black-box transformer predictions into clinically grounded narratives for speech-based cognitive impairment detection. The system integrates SHAP-based token attribution, linguistic features, and an LLM reasoning pipeline to map model outputs to specific cognitive-linguistic dimensions.

arxiv arXiv cs.CL · 9h ago

ToxiREX: A Dataset on Toxic REasoning in ConteXt

Researchers introduce ToxiREX, a new multilingual dataset designed to capture and explain implicit, context-dependent toxicity within Reddit comment threads. The dataset utilizes a systematic toxic reasoning schema to provide structured annotations for comments related to major global events across six languages.

arxiv arXiv cs.CL · 9h ago

Dialogue to Detection: A Multimodal Hybrid NLP Pipeline for Insurance Fraud Detection

This article introduces a synthetic multimodal framework designed to replicate First Notice of Loss (FNOL) conditions for insurance fraud detection, addressing the limitations of existing text-only approaches. The system generates agent-customer dialogue transcripts and two-speaker audios to integrate linguistic, behavioral, and speaker-based indicators.

arxiv arXiv cs.CL · 9h ago

The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization

This article introduces a signal-coverage matrix to stratify type and semantic errors in LLM autoformalization, moving beyond scalar type-correctness metrics. The framework categorizes outputs into true success, type-only, semantic-only, or both fail cells by crossing Lean elaborator results with semantic equivalence judgments.

arxiv arXiv cs.CL · 9h ago

Tree-of-Thoughts Hybrid Approach for Legal Case Judgement Summarization

This study proposes a novel tree-of-thoughts inspired extractive-abstractive summarization approach for legal case judgements, addressing the limited exploration of hybrid techniques in prior work. Experiments comparing DeepSeek and LLaMA models demonstrate that this proposed method yields superior summaries compared to traditional extractive or abstractive prompts.

arxiv arXiv cs.CL · 9h ago

DG^VoiC: Speaker Clustering for Fraud Investigation under Real Call-Centre Conditions

This paper introduces DG^VoiC, a voice clustering framework designed to identify repeated speakers in anonymized real call-center audio to assist in fraud investigation. The method combines sensitive information-aligned anonymization, speech-focused preprocessing, sliding-window speaker embedding extraction, and cosine similarity-based clustering.

arxiv arXiv cs.CL · 9h ago

LLMs Judge Worse Than They Generate in In-Context QA

A study challenges the assumption that large language models evaluate their own outputs better than they generate them, finding that generation accuracy exceeds self-evaluation on three of four tested benchmarks. The research utilizes a controlled in-context QA setting to isolate evaluation performance from parametric knowledge confounds.

arxiv arXiv cs.CL · 9h ago

MultiHashFormer: Hash-based Generative Language Models

The paper introduces MultiHashFormer, a framework enabling hash-based autoregression in causal language models by representing tokens as unique signatures of discrete hash IDs. This approach allows the model to compress token information into latent vectors for Transformer processing while mapping them back to text, effectively addressing the many-to-one collision issues that previously prevented hashing in generative contexts.

arxiv arXiv cs.CL · 9h ago

Single and Multi Truth Data Fusion using Large Language Models

This paper investigates the use of Large Language Models (LLMs) for data fusion tasks involving tabular data, covering both single-truth and multi-truth scenarios. The study evaluates various prompting strategies across three benchmark datasets to determine their effectiveness in resolving conflicting values from multiple sources.

arxiv arXiv cs.CL · 10h ago

Scaling limit of the Random Language Model

This article develops a quantitative theory for the Random Language Model (RLM) in a scaling limit where the number of hidden symbols approaches infinity while the grammar temperature approaches zero at a fixed ratio. The study establishes that the model admits a controlled description based on a large-deviation principle over rule-usage patterns, mapping the problem to Random Energy Models with nontrivial combinatorics.

arxiv arXiv cs.CL · 10h ago

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

This article introduces mechanism-driven monitors designed to detect large language model training instability before it causes significant damage. By deriving internal signals from the functional roles of critical modules, these monitors identify failures thousands of steps earlier than traditional loss-based methods.

arxiv arXiv cs.CL · 10h ago

From Tokens to States: LLMs as a Special Case of World Models

The article challenges the dichotomy between large language models and world models by arguing that LLMs are actually a degenerate special case of world models rather than a replacement. It posits that there is a continuous spectrum from next-token prediction to latent-space architectures, with current research already occupying intermediate positions.

arxiv arXiv cs.CL · 10h ago

Epi2Diff: Using LLM Reasoning Traces to Predict Human Item Difficulty

Researchers introduce Epi2Diff, a framework that maps Large Reasoning Model (LRM) traces into cognitively grounded episode sequences to predict human item difficulty in educational assessment. By modeling difficulty through reasoning scale, effort allocation, and state transitions, the method provides an interpretable alternative to costly human calibration.

arxiv arXiv cs.CL · 10h ago

HPRO: Hierarchical Progressive Reward Optimization for Emotional TTS

The authors propose HPRO, a hierarchical progressive reward optimization framework designed to enhance emotional expressiveness in LLM-based Text-to-Speech models while preserving linguistic intelligibility. This approach addresses structural mismatches in existing preference-driven methods by isolating content and emotion and bridging the gap between sparse rewards and dense generation.

arxiv arXiv cs.CL · 10h ago

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

This study investigates how vision-language models resolve conflicts between visual evidence and memorized world knowledge by combining activation patching with mechanistic analysis across three model families. The research identifies a sparse causal circuit where visual grounding is the default, while overriding it with prior knowledge requires specific attention heads.

arxiv arXiv cs.CL · 10h ago

Google Introduces Paper Assistant Tool for Automated Scientific Review

To address the scalability challenges of traditional peer review in the era of AI-assisted science, researchers propose a taxonomy of AI-human collaboration and introduce the Paper Assistant Tool (PAT). PAT is an agentic AI framework designed to ingest full scientific manuscripts and produce comprehensive evaluations by checking theoretical results, validating experiments, and identifying potential flaws.

media r/LocalLLaMA · 10h ago

Running Llama 3.1 405B on a Single 8xA100 Node with Hot-Loaded LoRA Adapters

A user demonstrates successfully running the Llama 3.1 405B model quantized to AWQ-INT4 on a single node equipped with eight A100 80GB GPUs, enabling up to 30 fine-tuned specialists to be loaded and switched in under 200ms.

media r/LocalLLaMA · 10h ago

Ubuntu, CUDA, llama.cpp , nvcc versioning

A user shares their experience resolving CUDA toolkit versioning issues on Ubuntu to enable compute capabilities for newer GPUs like the Blackwell architecture and RTX 5060 Ti. The post highlights that the default apt repository provides outdated CUDA versions, necessitating manual installation of the Debian package from NVIDIA's website.

arxiv arXiv cs.LG · 11h ago

Simulation-Free Estimation of Traffic Flows from Sparse Count Data

The authors propose a method for estimating time-varying traffic flow patterns from sparse aggregated vehicle counts by partitioning the study area and solving a weighted least-squares optimization problem. This approach uses a weighted contribution matrix to encode sensor coverage, steering the optimizer toward flow configurations that are directly observable.