arXiv cs.CL — korshunov.ai

Source · arXiv cs.CL

OpenBioRQ: Benchmark for Agentic Biomedical Research Faithfulness

OpenBioRQ introduces a benchmark of 12,553 unsolved biomedical research questions across 12 domains, designed to test agentic models' faithfulness and abstention. It evaluates models in a tool-using setting without answer keys, using real follow-up evidence rather than parametric knowledge, and reveals significant agentic collapse on the hardest questions where tools are no longer used despite being critical.

arxiv arXiv cs.CL · 2d ago

Latent Personal Memory: Dynamic Soft Prompts for LLM Personalization

Latent Personal Memory (LPM) represents user-specific memories as a compact, persistent matrix of N latent slots. These slots are mapped via a shared cross-attention network into dynamic, input-conditioned soft prompts that are prepended to a frozen LLM. LPM outperforms LoRA and Prompt Tuning by up to 8.8% and 54.4% on PersonaMem v1, reduces KV-cache usage by over 64x, matches LoRA accuracy on LoCoMo with 120x fewer parameters, and scales efficiently with context length, outperforming full-context at 128K tokens.

arxiv arXiv cs.CL · 2d ago

LRE: Few-Kilobytes Agent Memory with Zero Neural Cost

LRE is a CPU-only, language-model-free system that learns which interaction history units are load-bearing. It outperforms baselines in accuracy-cost balance, reducing peak context size by up to 52% and improving task completion by 37% in some cases. LRE achieves superior answer quality with 68% fewer tokens and requires no annotations or neural computation for training.

arxiv arXiv cs.CL · 2d ago

Beaver: Agent Harness for Scientific Curation from Multimodal Sources

Beaver is an agent harness that extracts structured information from scientific papers by integrating multimodal evidence tooling, task scaffolding, and artifact-grounded autoresearch. It achieves 81.0 on the Gold-Referenced Attribute Score, outperforming frontier agents by over 23 points, with key gains on high-value attributes requiring cross-modal reasoning.

arxiv arXiv cs.CL · 2d ago

Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection

A new hierarchical attention model detects multi-turn jailbreaks by encoding turns into compact representations and using a lightweight conversation module to capture dialogue dynamics. On 14,038 conversations, it achieves an F1 score of 0.9394, outperforming Claude Opus 4.7 by 0.07 and reducing false-positive rate by half. Ablation studies show that combining cross-attention and self-attention in the conversation module lowers false positives by 2.26 percentage points.

arxiv arXiv cs.CL · 2d ago

Study Finds AI Still Fails to Detect Legal Citation Hallucinations

A new study reveals over 1,000 legal filings contain fabricated citations, with the number rising annually. Benchmarking five AI models shows improved performance, with GPT-5 achieving 82.8% recall and 60.5% F1 in agentic settings, though all models struggle with subtle errors and face resource constraints due to limited information access.

arxiv arXiv cs.CL · 6d ago

LLM Alignment Using Implicit User Feedback

A new dataset, IFLLM, collects mouse trajectories and eye gazing data from users interacting with LLMs. It shows that implicit feedback significantly improves LLM alignment, boosting text-based reward model accuracy from 55% to 64% and nearly tripling response quality improvements after DPO training on eight LLMs.

arxiv arXiv cs.CL · 6d ago

H-RePlan: Hierarchical Recovery for Cross-Device Agent Systems

H-RePlan introduces a hierarchical replanning framework that separates device-local strategy recovery from global orchestrator replanning. It outperforms existing baselines by achieving higher completion and instruction adherence, with reduced token cost, through scope-aware recovery in multi-device agent systems.

arxiv arXiv cs.CL · 6d ago

LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents

LedgerAgent introduces a structured ledger to maintain task states separately in tool-calling agents. It renders these states into prompts and enforces policy constraints before tool execution, reducing policy violations and improving performance across customer-service domains.

arxiv arXiv cs.CL · 6d ago

Benchmarking Agentic Review Systems for AI-Assisted Research

A study evaluates four AI review systems across six language models, finding OpenAIReview with GPT-5.5 achieves 83.0% accuracy in matching paper quality to external signals and detects 71.6% of injected errors. Real user feedback shows positive sentiment, with a 1.44-to-1 vote ratio, though false positives and minor nitpicks remain common.

arxiv arXiv cs.CL · 6d ago

AgentFinVQA: Auditable, On-Premise Financial Chart QA

AgentFinVQA introduces a multi-agent pipeline for financial chart question answering that ensures auditability and on-premise deployability without significant accuracy loss. It outperforms baseline models by +7.68 pp using a proprietary backbone and +4.84 pp with open-weights Qwen3.6-27B-FP8, while providing a confidence signal via verifier output that improves human review routing.

arxiv arXiv cs.CL · 6d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 6d ago

JAMER: Project-Level Code Framework Dataset and Benchmark

JAMER introduces JamSet and JamBench, the first project-level game code dataset and benchmark on a professional game engine. Built from 8,133 verified Game Jam projects, it enables deterministic evaluation and reveals a capability cliff in AI models as project scale increases, with runtime pass rates dropping from 80.4% to 5.7%.

arxiv arXiv cs.CL · 6d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.

arxiv arXiv cs.CL · 6d ago

AtomMem: Simple and Effective Memory System for LLM Agents

AtomMem introduces a memory system that stores high-value atomic facts from long-form interactions. It uses hierarchical event structures and temporal profiles to capture coherent episodic contexts and track evolving user attributes, enabling stable and efficient memory evolution. Experiments on the LoCoMo benchmark show AtomMem achieves state-of-the-art performance in reasoning tasks.

arxiv arXiv cs.CL · 6d ago

REDACT: Multilingual PII Benchmark with Systematic Control

REDACT introduces a systematically controlled multilingual benchmark for personally identifiable information detection, featuring 51 entity types, 4,127 surface-form patterns, and 25 languages. It evaluates five detectors across 1,000 records, revealing that rule-based models fail on high-stakes data while LLMs perform better, especially in high-sensitivity categories. A reference-free LLM assessment confirms sensitivity-tier assignment as the most challenging evaluation axis.

arxiv arXiv cs.CL · 6d ago

GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs

GEMS enables training-free superposition of multiple semantic directions in LLMs by addressing distributional deviation and directional interference through geometric constraints. On GSM8K, it maintains 98% accuracy with three non-mathematical directions, while unconstrained addition drops to 4%; on Wikitext-2, it increases PPL by only 2.2%.

arxiv arXiv cs.CL · 6d ago

Over-Privileged Tool Selection in LLM Agents

LLM agents commonly select higher-privilege tools despite sufficient lower-privilege alternatives. This over-privileged behavior is amplified by transient tool failures and does not reliably improve with general safety alignment. A new privilege-aware post-training defense reduces unnecessary high-privilege tool use while maintaining agent capabilities.

arxiv arXiv cs.CL · 6d ago

STAGE: Source-Grounded Data Generation for Text-to-JSON

STAGE is a pipeline that generates text-to-JSON training data by using LLMs to synthesize reports and JSON schemas, validated against underlying spreadsheets. Evaluations on STAGE-Eval show it improves Qwen3-4B exact match from 31.37% to 74.27% and value accuracy from 45.46% to 90.69%.

arxiv arXiv cs.CL · 6d ago

HydraHead: Head-Level Hybrid Attention for Long-Context Performance

HydraHead introduces a head-level hybridization of Full and Linear Attention, leveraging interpretability to select retrieval-critical heads and fuse outputs via a scale-normalized module. Trained on 15B tokens, it achieves over 69% improvement over baseline at 512K context length, outperforming layer-wise hybrids and approaching Qwen3.5's performance on long-context tasks.