All articles — korshunov.ai

All articles Page 1 / 101

Einstein World Models: Visualizing Counterfactuals for LLM Reasoning

The article introduces Einstein World Models (EWMs), a framework designed to enhance large language model reasoning by integrating visual-temporal rollouts into the reasoning trace. This approach allows models to utilize visual thought experiments as inspectable hypotheses to complement text-based processing.

arxiv arXiv cs.CL · 4h ago

Auditing Framing-Sensitive Behavioral Instability in LLMs for Mental Health

This study investigates how semantically similar concerns presented through different contextual framings elicit varying responses from instruction-tuned large language models, potentially challenging system reliability. Using controlled matched prompts and layer-wise probing analyses, the authors demonstrate that framing systematically alters interpretive response tendencies across multiple model architectures.

arxiv arXiv cs.CL · 4h ago

ReaORE: Reasoning-Guided Progressive Open Relation Extraction Empowered by Large Reasoning Models

Researchers propose ReaORE, a framework for open relation extraction that utilizes large reasoning models to achieve reliable generalization to unseen relation types. The method addresses limitations of current clustering and direct generation approaches through a coarse-to-fine reasoning process.

arxiv arXiv cs.CL · 4h ago

Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs

This study investigates the presence and structure of emotion vectors in open-weight large language models, specifically Apertus-8B-Instruct-2509 and Gemma-4-E4B-it. The research confirms that these models encode valence geometry with high correlation to human psychological structures, approaching the levels previously observed in Claude Sonnet 4.5.

arxiv arXiv cs.CL · 4h ago

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

The authors introduce MinGram, a minimalist unigram tokenizer that simplifies training by using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This approach removes the need for suffix arrays, forward-backward passes, and iterative prune loops, making the procedure significantly less complex than standard methods.

arxiv arXiv cs.CL · 4h ago

Improving Verbalized Uncertainty Calibration in Medical VQA

This work addresses the tendency of multimodal large language models to produce overconfident outputs in Medical Visual Question Answering by proposing a training-based framework that finetunes these models for better calibration. The method employs a composite loss function combining Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL divergence terms to align model confidence with actual correctness.

arxiv arXiv cs.CL · 4h ago

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

Researchers propose Psy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into Interaction Perception, Psychological Empathy, and Logical Construction to improve character fidelity. To address gradient misalignment in reinforcement learning, they introduce Role-Aware Policy Optimization (RAPO), which uses profile-token mutual information to weight gradients asymmetrically.

arxiv arXiv cs.CL · 4h ago

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Researchers introduce NuclearQAv2, a new benchmark designed to assess the reliability of large language models in nuclear engineering by testing factual knowledge, quantitative reasoning, and conceptual understanding.

arxiv arXiv cs.CL · 5h ago

Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

Researchers propose a Judge-Aware Gated Multi-Task Learning architecture that disentangles objective case facts from adjudicative context to improve legal outcome prediction. The model uses a fine-grained outcome taxonomy and a gated fusion mechanism to dynamically modulate reliance on judge identity, evaluated on 13,937 UK Employment Tribunal decisions.

arxiv arXiv cs.CL · 5h ago

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

A study introduces the "riddle riddle" paradigm to determine whether large language models (LLMs) rely on flexible reasoning or pattern matching, revealing that humans and LLMs fail in opposite directions. In experiments involving nine state-of-the-art LLMs and 100 human participants, LLMs performed significantly worse on riddle riddles than on genuine riddles, while humans showed the reverse trend.

arxiv arXiv cs.CL · 5h ago

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Researchers introduce HarmVideoBench, a multi-layered diagnostic benchmark designed to evaluate large vision-language models on their ability to understand harmful videos beyond superficial cues. The benchmark addresses limitations in existing works by incorporating explanatory rationales and assessing three hierarchical dimensions of harm: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.

arxiv arXiv cs.CL · 5h ago

Forecasting With LLMs: Improved Generalization Through Feature Steering

This study applies Large Language Models to forecasting tasks and uses sparse autoencoders to analyze their internal states, distinguishing between time-specific knowledge and generalizable patterns. The research identifies specific features associated with both time-aware reasoning and look-ahead-biased reasoning.

arxiv arXiv cs.CL · 5h ago

Syntactic Belief Update as the Driver of Garden Path Processing Difficulty

The article proposes Syntactic Belief Update, a model that predicts processing difficulty in garden path sentences by measuring the magnitude of syntactic belief updates via generalized Rényi divergence. This approach outperforms lexical surprisal by providing a better fit to human reading time data.

arxiv arXiv cs.CL · 5h ago

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.

arxiv arXiv cs.CL · 5h ago

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

The authors propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions to provide interpretable, multi-dimensional scores for large language models. This approach generates transparent question-level feedback and calibrated overall scores by having an LLM answer fine-grained evaluation questions independently for each output.

blog Simon Willison · 5h ago

datasette-export-database 0.3a2 fixes version pin

The datasette-export-database plugin version 0.3a2 addresses a compatibility issue caused by an overly strict dependency constraint in the previous release.

github llama.cpp · 5h ago

llama.cpp b9827 release adds CUDA 2D async copy optimization

The llama.cpp b9827 release introduces a performance optimization for CUDA by adding a cudaMemcpy2DAsync fast path to the ggml_cuda_cpy function. This change accelerates same-type, same-shape strided copies where tensors are not fully contiguous but each row is contiguous, replacing slower element-wise scalar copy kernels.

media r/LocalLLaMA · 6h ago

BatonBot: Open Source Local Kanban Workflow for AI Coding Agents

The author introduces BatonBot, an open-source local-first application designed to streamline AI coding workflows by reducing the need for constant user supervision. The tool addresses the inefficiency of sequential agent interactions by allowing users to set up tasks and track progress visually on a Kanban-style board.

media r/LocalLLaMA · 6h ago

audio.cpp: 12 audio models in one C++ runtime with up to 5x speedup

The open-source project audio.cpp provides a native C++ inference framework for audio models built on top of ggml, currently supporting 12 released model families including TTS, ASR, and voice conversion. Benchmarks on Ubuntu/CUDA demonstrate that text-to-speech performance in this runtime is up to 5x faster than the corresponding Python reference implementations.

blog Simon Willison · 6h ago

Bruce Schneier on AI Liability and German Ruling

Bruce Schneier discusses a recent German ruling that holds Google liable for errors in its AI overviews, arguing that AI agents should be treated as agents of the deploying organization.