All articles — korshunov.ai

All articles Page 1 / 102

Improving Verbalized Uncertainty Calibration in Medical VQA

This work addresses the tendency of multimodal large language models to produce overconfident outputs in Medical Visual Question Answering by proposing a training-based framework that finetunes these models for better calibration. The method employs a composite loss function combining Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL divergence terms to align model confidence with actual correctness.

arxiv arXiv cs.CL · 5h ago

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

Researchers propose Psy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into Interaction Perception, Psychological Empathy, and Logical Construction to improve character fidelity. To address gradient misalignment in reinforcement learning, they introduce Role-Aware Policy Optimization (RAPO), which uses profile-token mutual information to weight gradients asymmetrically.

arxiv arXiv cs.CL · 5h ago

NuclearQAv2: A Structured Benchmark for Evaluating Domain-Science Competence in Large Language Models

Researchers introduce NuclearQAv2, a new benchmark designed to assess the reliability of large language models in nuclear engineering by testing factual knowledge, quantitative reasoning, and conceptual understanding.

arxiv arXiv cs.CL · 6h ago

Towards Explainable Adjudicative Variance: Quantifying Judicial Discretion via Gated Multi-Task Learning

Researchers propose a Judge-Aware Gated Multi-Task Learning architecture that disentangles objective case facts from adjudicative context to improve legal outcome prediction. The model uses a fine-grained outcome taxonomy and a gated fusion mechanism to dynamically modulate reliance on judge identity, evaluated on 13,937 UK Employment Tribunal decisions.

arxiv arXiv cs.CL · 6h ago

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

A study introduces the "riddle riddle" paradigm to determine whether large language models (LLMs) rely on flexible reasoning or pattern matching, revealing that humans and LLMs fail in opposite directions. In experiments involving nine state-of-the-art LLMs and 100 human participants, LLMs performed significantly worse on riddle riddles than on genuine riddles, while humans showed the reverse trend.

arxiv arXiv cs.CL · 6h ago

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Researchers introduce HarmVideoBench, a multi-layered diagnostic benchmark designed to evaluate large vision-language models on their ability to understand harmful videos beyond superficial cues. The benchmark addresses limitations in existing works by incorporating explanatory rationales and assessing three hierarchical dimensions of harm: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.

arxiv arXiv cs.CL · 6h ago

Forecasting With LLMs: Improved Generalization Through Feature Steering

This study applies Large Language Models to forecasting tasks and uses sparse autoencoders to analyze their internal states, distinguishing between time-specific knowledge and generalizable patterns. The research identifies specific features associated with both time-aware reasoning and look-ahead-biased reasoning.

arxiv arXiv cs.CL · 6h ago

Syntactic Belief Update as the Driver of Garden Path Processing Difficulty

The article proposes Syntactic Belief Update, a model that predicts processing difficulty in garden path sentences by measuring the magnitude of syntactic belief updates via generalized Rényi divergence. This approach outperforms lexical surprisal by providing a better fit to human reading time data.

arxiv arXiv cs.CL · 6h ago

Paved with True Intents: Intent-Aware Training Improves LLM Safety Classification Across Training Regimes

The authors introduce AIMS, a dataset of 1,724 human-annotated difficult safety prompts paired with intent descriptions and harm labels, to evaluate intent-aware training across multiple regimes. They argue that modeling user intent as an explicit signal significantly improves the robustness of safety classifiers.

arxiv arXiv cs.CL · 6h ago

Ask, Don't Judge: Binary Questions for Interpretable LLM Evaluation and Self-Improvement

The authors propose BINEVAL, a framework that decomposes evaluation criteria into atomic binary questions to provide interpretable, multi-dimensional scores for large language models. This approach generates transparent question-level feedback and calibrated overall scores by having an LLM answer fine-grained evaluation questions independently for each output.

blog Simon Willison · 6h ago

datasette-export-database 0.3a2 fixes version pin

The datasette-export-database plugin version 0.3a2 addresses a compatibility issue caused by an overly strict dependency constraint in the previous release.

github llama.cpp · 6h ago

llama.cpp b9827 release adds CUDA 2D async copy optimization

The llama.cpp b9827 release introduces a performance optimization for CUDA by adding a cudaMemcpy2DAsync fast path to the ggml_cuda_cpy function. This change accelerates same-type, same-shape strided copies where tensors are not fully contiguous but each row is contiguous, replacing slower element-wise scalar copy kernels.

media r/LocalLLaMA · 7h ago

BatonBot: Open Source Local Kanban Workflow for AI Coding Agents

The author introduces BatonBot, an open-source local-first application designed to streamline AI coding workflows by reducing the need for constant user supervision. The tool addresses the inefficiency of sequential agent interactions by allowing users to set up tasks and track progress visually on a Kanban-style board.

media r/LocalLLaMA · 7h ago

audio.cpp: 12 audio models in one C++ runtime with up to 5x speedup

The open-source project audio.cpp provides a native C++ inference framework for audio models built on top of ggml, currently supporting 12 released model families including TTS, ASR, and voice conversion. Benchmarks on Ubuntu/CUDA demonstrate that text-to-speech performance in this runtime is up to 5x faster than the corresponding Python reference implementations.

blog Simon Willison · 7h ago

Bruce Schneier on AI Liability and German Ruling

Bruce Schneier discusses a recent German ruling that holds Google liable for errors in its AI overviews, arguing that AI agents should be treated as agents of the deploying organization.

media r/LocalLLaMA · 7h ago

JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup

JetSpec introduces a speculative decoding method called causal parallel tree drafting that co-optimizes drafting cost and quality to reduce LLM generation latency. The approach achieves up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while maintaining lossless accuracy.

media r/LocalLLaMA · 7h ago

US Govt to individually approve who gets GPT 5.6.

A Reddit post by user /u/AtlanticHM on r/LocalLLaMA shares an image with the title "US Govt to individually approve who gets GPT 5.6.".

media r/LocalLLaMA · 7h ago

Resetting NVIDIA RTX 3090 Idle Power Consumption

A user reports that while driver version 595.71.05 previously allowed dual RTX 3090s to drop to 13-15W when idle, one card is now stuck at 24-30W with zero activity and fans off.

media r/LocalLLaMA · 7h ago

Prices of graphic cards are going crazy, should I buy a second card though?

A user on r/LocalLLaMA is considering adding a second GPU to their rig for local LLM inference but is deterred by the sharp increase in prices for AMD Radeon RX 7900 XTX and XT cards. The poster notes that while new RX 7900 XTX prices have risen to 1200€, used units are around 900€, and the budget-friendly RX 7900 XT starts at 700€.

media r/LocalLLaMA · 7h ago

Handling per-agent isolation and environment lifecycle in an orchestration library

The author details the architecture of a harness-agnostic orchestration library, focusing on managing agent environments through distinct workspace and runtime abstractions. The system defines four sequential states—unprovisioned, provisioned, started, and retired—to control the lifecycle of each agent instance.