All articles
arxiv arXiv cs.CL · 3h ago

Heterogeneous Neural Predictivity from Language Models During Naturalistic Comprehension

This study demonstrates that frozen language models can serve as effective neural predictors for brain activity during natural speech and text comprehension, while distinguishing predictive utility from claims about shared neural organization. The analysis of MEG and ECoG data revealed widespread positive prediction gains over low-level baselines, though participant-level advantages were localized rather than uniform.

arxiv arXiv cs.CL · 5h ago

Auditing Framing-Sensitive Behavioral Instability in LLMs for Mental Health

This study investigates how semantically similar concerns presented through different contextual framings elicit varying responses from instruction-tuned large language models, potentially challenging system reliability. Using controlled matched prompts and layer-wise probing analyses, the authors demonstrate that framing systematically alters interpretive response tendencies across multiple model architectures.

arxiv arXiv cs.CL · 5h ago

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

The authors introduce MinGram, a minimalist unigram tokenizer that simplifies training by using a BPE-derived seed vocabulary, Hard EM on a minimum-token path, and a single flat score-pruning step. This approach removes the need for suffix arrays, forward-backward passes, and iterative prune loops, making the procedure significantly less complex than standard methods.

arxiv arXiv cs.CL · 5h ago

Improving Verbalized Uncertainty Calibration in Medical VQA

This work addresses the tendency of multimodal large language models to produce overconfident outputs in Medical Visual Question Answering by proposing a training-based framework that finetunes these models for better calibration. The method employs a composite loss function combining Brier-style calibration, anchor regularization, contrastive image-text alignment, and KL divergence terms to align model confidence with actual correctness.

arxiv arXiv cs.CL · 5h ago

Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

Researchers propose Psy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into Interaction Perception, Psychological Empathy, and Logical Construction to improve character fidelity. To address gradient misalignment in reinforcement learning, they introduce Role-Aware Policy Optimization (RAPO), which uses profile-token mutual information to weight gradients asymmetrically.

arxiv arXiv cs.CL · 6h ago

The Riddle Riddle: Testing Flexible Reasoning in Large Language Models and Humans

A study introduces the "riddle riddle" paradigm to determine whether large language models (LLMs) rely on flexible reasoning or pattern matching, revealing that humans and LLMs fail in opposite directions. In experiments involving nine state-of-the-art LLMs and 100 human participants, LLMs performed significantly worse on riddle riddles than on genuine riddles, while humans showed the reverse trend.

arxiv arXiv cs.CL · 6h ago

HarmVideoBench: Benchmarking Harmful Video Understanding in Large Multimodal Models

Researchers introduce HarmVideoBench, a multi-layered diagnostic benchmark designed to evaluate large vision-language models on their ability to understand harmful videos beyond superficial cues. The benchmark addresses limitations in existing works by incorporating explanatory rationales and assessing three hierarchical dimensions of harm: Observable Evidence, Clip-Internal Meaning, and Beyond-Clip Reasoning.