Reasoning models — korshunov.ai

Reasoning models Page 1 / 35

Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

A new metric, Hardness Adjusted Transfer (HAT) Score, isolates true cross-lingual transfer by separating it from source language accuracy gains. Analysis of 20 language models shows transfer in small models is not broken, progress with model size is slower than expected, and clear improvements have occurred over time.

arxiv arXiv cs.CL · 2d ago

Can LLMs Control Readability in Arabic?

A multi-dimensional evaluation framework assesses CEFR-controlled Arabic text generation by LLMs. Results show that CEFR-guided prompting with lexical constraints achieves high alignment with linguistic profiles and predicted readability, while unconstrained prompting shows weak control.

arxiv arXiv cs.CL · 2d ago

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

A study evaluates over 30 large language models on Japanese grapheme-to-phoneme conversion using 3000 manually annotated sentences. The best LLMs achieve a kana character error rate below 0.52%, outperforming the best conventional tool (1.03%). Parse mode, with rule-based post-processing, performs better than direct mode for most models, and LLM-predicted kana improves TTS pronunciation when fed into kana-input TTS.

arxiv arXiv cs.CL · 2d ago

Nous: A Predictive World Model for Long-Term Agent Memory

Nous introduces a memory architecture based on prediction rather than storage, using categorical probability distributions to model world knowledge. Evaluated on LoCoMo with GPT-4o-mini, it achieves F1 scores of 63.50 (single-hop), 55.32 (multi-hop), -58.57 (temporal), and 62.50 (open-domain), outperforming A-MEM in three categories and BeliefMem in all, though evaluation differences limit full comparability.

Are Multilingual Models Actually Improving? Isolating True Cross-Lingual Transfer

Can LLMs Control Readability in Arabic?

Benchmarking LLMs for Japanese Grapheme-to-Phoneme Conversion

Nous: A Predictive World Model for Long-Term Agent Memory

Can Reasoning Models Detect Changes to Their Chains of Thought?

TSCognition and TSAlign Advance Time Series Reasoning with LLMs

Score Granularity Gap in LLM Confidence Scoring

Adaptive Data Scheduling Improves LLM Reinforcement Learning

Curiosity as Linguistic Intervention in LLM Tutoring

ORBIT: Training-Free Multi-Attribute Behavioral Steering

A Taxonomy of Conceptual Alignment in Human-Robot Dialogue

First-Token Broadcasters in Transformers: Mechanistic Origins of Language Identity

P4IR Framework Improves LLM-Based Code Compliance Accuracy

Knowledge-Graph Grounding Helps LLMs Only for Out-of-Training Knowledge

LLMs Use Difference-Making Logic to Learn Causal Structure

Character Variety in LLM-Generated Stories

Speech-Text Models Latently Transcribe Speech in Intermediate Layers

FACTOR Enables Adaptive Verification for Factuality in Long-Form Generation

Buddy System: Rust entropy monitor with NER-gated uncertainty for tiered LLM inference

VADAOrchestra: Neurosymbolic Orchestration of Adaptive Reasoning Workflows