All articles — korshunov.ai

All articles Page 1 / 113

LLMs Judge Worse Than They Generate in In-Context QA

A study challenges the assumption that large language models evaluate their own outputs better than they generate them, finding that generation accuracy exceeds self-evaluation on three of four tested benchmarks. The research utilizes a controlled in-context QA setting to isolate evaluation performance from parametric knowledge confounds.

arxiv arXiv cs.CL · 5h ago

MultiHashFormer: Hash-based Generative Language Models

The paper introduces MultiHashFormer, a framework enabling hash-based autoregression in causal language models by representing tokens as unique signatures of discrete hash IDs. This approach allows the model to compress token information into latent vectors for Transformer processing while mapping them back to text, effectively addressing the many-to-one collision issues that previously prevented hashing in generative contexts.

arxiv arXiv cs.CL · 5h ago

Single and Multi Truth Data Fusion using Large Language Models

This paper investigates the use of Large Language Models (LLMs) for data fusion tasks involving tabular data, covering both single-truth and multi-truth scenarios. The study evaluates various prompting strategies across three benchmark datasets to determine their effectiveness in resolving conflicting values from multiple sources.

arxiv arXiv cs.CL · 6h ago

Scaling limit of the Random Language Model

This article develops a quantitative theory for the Random Language Model (RLM) in a scaling limit where the number of hidden symbols approaches infinity while the grammar temperature approaches zero at a fixed ratio. The study establishes that the model admits a controlled description based on a large-deviation principle over rule-usage patterns, mapping the problem to Random Energy Models with nontrivial combinatorics.

arxiv arXiv cs.CL · 6h ago

Mechanism-Driven Monitors for Preemptive Detection of LLM Training Instability

This article introduces mechanism-driven monitors designed to detect large language model training instability before it causes significant damage. By deriving internal signals from the functional roles of critical modules, these monitors identify failures thousands of steps earlier than traditional loss-based methods.

arxiv arXiv cs.CL · 6h ago

From Tokens to States: LLMs as a Special Case of World Models

The article challenges the dichotomy between large language models and world models by arguing that LLMs are actually a degenerate special case of world models rather than a replacement. It posits that there is a continuous spectrum from next-token prediction to latent-space architectures, with current research already occupying intermediate positions.

arxiv arXiv cs.CL · 6h ago

Epi2Diff: Using LLM Reasoning Traces to Predict Human Item Difficulty

Researchers introduce Epi2Diff, a framework that maps Large Reasoning Model (LRM) traces into cognitively grounded episode sequences to predict human item difficulty in educational assessment. By modeling difficulty through reasoning scale, effort allocation, and state transitions, the method provides an interpretable alternative to costly human calibration.

arxiv arXiv cs.CL · 6h ago

HPRO: Hierarchical Progressive Reward Optimization for Emotional TTS

The authors propose HPRO, a hierarchical progressive reward optimization framework designed to enhance emotional expressiveness in LLM-based Text-to-Speech models while preserving linguistic intelligibility. This approach addresses structural mismatches in existing preference-driven methods by isolating content and emotion and bridging the gap between sparse rewards and dense generation.

arxiv arXiv cs.CL · 6h ago

Vision-Default, Prior-Override: Causal Mechanisms of Perception-Knowledge Conflict in Vision-Language Models

This study investigates how vision-language models resolve conflicts between visual evidence and memorized world knowledge by combining activation patching with mechanistic analysis across three model families. The research identifies a sparse causal circuit where visual grounding is the default, while overriding it with prior knowledge requires specific attention heads.

arxiv arXiv cs.CL · 6h ago

Google Introduces Paper Assistant Tool for Automated Scientific Review

To address the scalability challenges of traditional peer review in the era of AI-assisted science, researchers propose a taxonomy of AI-human collaboration and introduce the Paper Assistant Tool (PAT). PAT is an agentic AI framework designed to ingest full scientific manuscripts and produce comprehensive evaluations by checking theoretical results, validating experiments, and identifying potential flaws.

media r/LocalLLaMA · 6h ago

Running Llama 3.1 405B on a Single 8xA100 Node with Hot-Loaded LoRA Adapters

A user demonstrates successfully running the Llama 3.1 405B model quantized to AWQ-INT4 on a single node equipped with eight A100 80GB GPUs, enabling up to 30 fine-tuned specialists to be loaded and switched in under 200ms.

media r/LocalLLaMA · 6h ago

Ubuntu, CUDA, llama.cpp , nvcc versioning

A user shares their experience resolving CUDA toolkit versioning issues on Ubuntu to enable compute capabilities for newer GPUs like the Blackwell architecture and RTX 5060 Ti. The post highlights that the default apt repository provides outdated CUDA versions, necessitating manual installation of the Debian package from NVIDIA's website.

arxiv arXiv cs.LG · 7h ago

Simulation-Free Estimation of Traffic Flows from Sparse Count Data

The authors propose a method for estimating time-varying traffic flow patterns from sparse aggregated vehicle counts by partitioning the study area and solving a weighted least-squares optimization problem. This approach uses a weighted contribution matrix to encode sensor coverage, steering the optimizer toward flow configurations that are directly observable.

arxiv arXiv cs.LG · 7h ago

SQLConductor: Search-to-Policy Learning for Step-wise Text-to-SQL Orchestration

The paper introduces SQLConductor, a step-wise orchestration learning framework for Text-to-SQL that formulates subtasks as specialized actions and trains a policy model to select the next action based on intermediate artifacts and feedback.

arxiv arXiv cs.LG · 7h ago

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

VeriEvol is an iterative framework designed to scale multimodal mathematical reasoning by decoupling prompt difficulty from answer reliability during data construction. It employs a type-aware evolution module to generate harder prompts and the HTV-Agent verifier to ensure answer correctness through multi-source counter-evidence.

arxiv arXiv cs.LG · 7h ago

The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

This article introduces a framework for modeling the energy consumption of Transformer training on multiple GPUs, aiming to address growing computational costs in sustainable system design.

arxiv arXiv cs.LG · 7h ago

SuperCond-GNN: Scalable Graph Neural Network Surrogate for Superconducting Circuit Simulations

This paper introduces SuperCond-GNN, a graph neural network surrogate model designed to predict voltage distribution in high-temperature superconducting magnets by mapping lumped-element circuits to graph representations. The model achieves a mean MAPE of 4.3% on tape stacks and enables fast inference of current redistribution across various circuit configurations.

arxiv arXiv cs.LG · 7h ago

Approximating velocity fields with planted attractors via Neural-ODEs for classification

This work employs Neural ODEs equipped with a curated collection of equilibrium points to perform classification tasks. The planted attractors serve as indicators for target classes, while the velocity field shapes the dynamical landscape to direct inputs toward their corresponding destinations.

arxiv arXiv cs.LG · 7h ago

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Researchers propose Self-Aware Scheduling (SAS), a method that learns an optimal token unmasking order for masked diffusion language models to improve generation quality. By deriving a tractable upper bound on sequential decoding mismatch, the approach casts order selection as a policy optimization problem using Group Relative Policy Optimization.

media r/LocalLLaMA · 7h ago

Minimax M3 vs M2.7

A Reddit user is requesting feedback from individuals who have updated to the Minimax M3 model from version M2.7. The post seeks community insights on the differences and performance between these two iterations.