Reasoning models
arxiv arXiv cs.CL · 3d ago

Symmetric Q-Sorts Measure Value-Structure Alignment in LLMs

A new framework uses symmetric human-LLM Q-sorts to evaluate how large language models structurally align with moral values. By comparing rankings of 140 moral statements across 12 LLMs and a human reference sample, the study identifies cross-family heterogeneity and localized misalignments, showing that global performance scores can mask structural flaws. The results highlight the need for structural evaluations to complement traditional item-level moral benchmarks.

arxiv arXiv cs.CL · 3d ago

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

A study evaluates over 30 large language models on Japanese grapheme-to-phoneme conversion using 3000 manually annotated sentences. The best LLMs achieve a kana character error rate below 0.52%, outperforming the best conventional tool (1.03%). Parse mode, with rule-based post-processing, performs better than direct mode for most models, and LLM-predicted kana improves TTS pronunciation when fed into kana-input TTS.

arxiv arXiv cs.CL · 3d ago

First-Token Broadcasters in Transformers: Mechanistic Origins of Language Identity

LIHA identifies a small set of first-token broadcaster heads in GPT-2 that persistently attend to the initial prompt token, causing language switches. Instruction tuning reorganizes these circuits, concentrating language identity at early layers, as shown in a controlled comparison between Qwen2.5-1.5B-Base and Qwen2-1.5B-Instruct models. First-token broadcasting is script-specific, with non-Latin languages processed at layer 0, matching the instruct-tuned model's pattern.

arxiv arXiv cs.CL · 3d ago

P4IR Framework Improves LLM-Based Code Compliance Accuracy

P4IR, a two-stage framework, uses supervised fine-tuning and Group Relative Policy Optimization to enhance large language model-based automated code compliance systems. It reduces tree edit and token-level Levenshtein distances by up to 23.8% and 38.6% respectively, outperforming leading LLMs like Claude Opus, GPT-5.2, and GLM-4.7 in zero-shot settings with few-shot prompting, and reduces false positives by a statistically significant margin.

arxiv arXiv cs.CL · 3d ago

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

Interleaved speech-language models undergo an implicit transcription phase where spoken words become decodable as text tokens in intermediate layers, despite no speech recognition training. Up to 77% of the data shows the spoken word appearing as a top candidate text prediction, followed by text continuation and return to speech. This behavior is driven by interleaving data and text LM initialization, correlating with spoken knowledge performance.