All articles — korshunov.ai

All articles Page 1 / 92

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

A study identifies 'natural ungrokking,' a phenomenon where small language models lose learned grammatical rules midway through pretraining despite the evidence remaining in the data. Researchers observed that a model learning pronoun-gender agreement with Sue collapsed from 0.94 accuracy to near zero by step 3,500 without any corresponding spike in the loss curve. The survival of these rules is determined by support frequency within the training stream, while the data-to-parameter ratio only modulates the depth of the collapse. This emergence-then-collapse dynamic was replicated across multiple corpora, budgets, and seeds, and confirmed in public Pythia checkpoints where collapse depth correlated with model scale. The forgetting process acts as a displacement mechanism where a competing surface pattern out-competes the rule, causing the log-probability margin to cross zero within 100 steps of behavioral failure. Control over this fate is asymmetric; while injecting counter-evidence can destroy rules via a monotone dose-response, restoring support even at 450 times the sustaining level fails to recover them.

arxiv arXiv cs.CL · 3h ago

Keyword Lexicon Blindness Distorts Rhetorical Stance Measurement

A study analyzing 85 interviews with four public intellectuals reveals that keyword-based scoring can produce statistical artifacts regarding rhetorical stance. Initial analysis showed a robust negative-affect and emphatic-certainty co-occurrence pattern with high correlation coefficients ranging from r = 0.72 to 0.93. However, replacing this method with LLM-based zero-shot semantic classification on the full diarized corpus of 32,625 sentences significantly reduced these correlations. For instance, Dalio's correlation dropped from 0.851 to 0.206, while other speakers exhibited negative or null relationships between negativity and certainty. In contrast, the LLM analysis revealed a strong coupling between negative sentiment and hedging language, aligning with conventional expectations of pessimistic discourse. The discrepancy stems from three structural failures in keyword lexicons: syntactic blindness, polysemy blindness, and categorical absence. These flaws can invert semantic meaning, such as scoring 'never absolutely totally confident' as high certainty. The authors argue that keyword counts measure lexical co-occurrence tendencies rather than epistemic certainty, constituting a category error.

arxiv arXiv cs.CL · 3h ago

Auditing Order Sensitivity in Multimodal Large Language Models

The study introduces Facet-Probe, a five-facet audit of 18 frontier and open-weight multimodal large language models to assess order sensitivity. Standard benchmarks often miss whether shuffling evidence changes answers, a reliability property highlighted by emerging AI evaluation guidelines. Using a Bayesian item-response model, the researchers separated ordering noise from per-facet bias and estimated decoder-stochastic floors via same-ordering controls. The audit found that none of the 18 models are order-invariant, with panel-mean flip rates spanning 24-50% across different facets. Even the best-performing model flipped its answer on 13.4% of trials, indicating that higher capability does not eliminate this vulnerability. Mitigation tests using training-free prompt changes proved modality-conditional and failed to transfer between text and visual reasoning tasks. These findings suggest that prompt-level fixes are insufficient for general order robustness, motivating architectural solutions. The authors propose cross-ordering flip rate as a standard reporting axis for future MLLM evaluations.

arxiv arXiv cs.CL · 3h ago

Real-Time Voice AI Hears but Does Not Listen

A study evaluates four leading production real-time voice systems: OpenAI's GPT Realtime 2, Google's Gemini 3.1 Flash Live, and Alibaba's Qwen3.5 Omni Plus and Omni Flash. The research focuses on tasks where both words and vocal delivery convey meaningful information across three consequential scenarios. All four systems act on the literal words rather than the voice, leading to errors such as ending calls with crying users who insist nothing is wrong or approving wire transfers made in frightened voices. Surprisingly, this disconnect is often not a failure of perception, as three of the four systems can reliably identify distress, fear, or sarcasm when asked directly. Despite this awareness, the models ignore these emotional cues during decision-making, exhibiting what the authors term the 'emotional intelligence gap.' The study also notes that systems estimate accent and age based on word biases rather than acoustic properties. Prompting the systems to explicitly attend to vocal delivery improves performance only partially and inconsistently. These findings suggest current real-time voice AI behaves as if speech were reduced to a transcript, warranting caution in settings where tone is critical.

media r/LocalLLaMA · 4h ago

Local NL-to-SQL Pipeline Using Qwen3 4B and Deterministic Planning

A developer has implemented a fully local natural language to filter generation system on hardware lacking a GPU. The solution utilizes the Qwen3 4B Instruct model running via llama.cpp with CPU-only inference. Rather than generating SQL directly, the model focuses on semantic intent and structured filter selection. A deterministic query planner subsequently handles the SQL generation and optimization processes. The pipeline employs a BM25 and embedding hybrid retrieval method using FAISS for vector storage. It retrieves the top four matching examples from approximately 800 embedded semantic instances to inject into the prompt. This approach allows the system to function effectively within strict constraints of limited RAM and no internet access.

media r/LocalLLaMA · 4h ago

Locked Dell Quote for 6x RTX PRO 6000 Max-Q at $8,960

A user on Reddit shared a locked Dell quote for six RTX PRO 6000 Blackwell Max-Q GPUs priced at $8,959.99 per unit. This offer is significantly lower than the list price of $15,999 that was posted just one day prior. The initial quote for all six units expires in approximately three hours from the time of posting. The author also holds a separate valid quote for two units at the same discounted rate until July 3. They are seeking community ideas on how to proceed with purchasing the hardware for a local GLM 5.2 inference cluster. Although they have the funds to buy all six units immediately, they want creative solutions to utilize the expiring bulk discount. The author clarified that they are not looking for financial advice or requests to purchase the GPUs themselves.

media r/LocalLLaMA · 4h ago

Reddit Inquiry on Running Large Models with 4x-8x RTX 6000 PROs

A Reddit user is seeking community feedback regarding the performance of large language models on systems equipped with four to eight NVIDIA RTX 6000 PRO GPUs. The inquiry specifically targets users who have between 384GB and 768GB of VRAM available for running models such as GLM 5.2, Kimi 2.7, and DeepSeek V4 Pro. The poster notes that while these models can technically run at 4-bit quantization, they may not fit within the memory constraints when using 8-bit precision. They reference a benchmark repository but highlight that it lacks data for the most recent model releases. A key concern raised is whether the performance degradation from using 4-bit versus 8-bit quantization is significant enough to impact agentic or programming tasks. The user also asks which inference backends, such as vLLM or SGLang, are currently being utilized by others in this hardware configuration.

arxiv arXiv cs.CL · 4h ago

Structuring an Arabic-English Machine-Readable Dictionary Using Parsing Expression Grammars

This paper presents a method for structuring a machine-readable version of the Arabic-English Al-Mawrid dictionary, addressing the lack of standardization in printed formats. The approach converts unstructured streams of words and punctuation into explicit hierarchical structures that define entry components such as subentries, domain labels, and translation equivalences. Parsing serves as the central step within a cascaded design, implemented using the parsing expression grammars formalism. This technique allows for the automatic or semi-automatic organization of dictionary entries despite the absence of microstructure standardization in Arabic dictionaries. The study demonstrates that inducing microstructure enables plausible accuracy in structuring these complex lexical resources. By transforming raw text into defined formats, the work supports downstream natural language processing applications requiring machine-readable lexical data.

arxiv arXiv cs.CL · 4h ago

WBCMor VQA: A Bilingual English-Urdu Hematology Visual Question Answering Benchmark

Researchers have introduced WBCMor VQA, a clinically validated bilingual benchmark for leukemia and normal white blood cell analysis in English and Urdu. This resource addresses the gap in multilingual healthcare technologies, particularly in regions like Pakistan where clinical documentation often mismatches patient communication languages. The dataset comprises 110,000 bilingual question-answer pairs annotated across 20,000 single-cell images of leukemic and normal white blood cells. To ensure linguistic consistency and clinical correctness, the benchmark utilizes morphology-aware annotations from the LeukemiaAttri and WBCAtt datasets alongside a domain-specific Urdu hematology dictionary. The study also highlights the limitations of existing English-centric vision-language resources in diverse healthcare environments. Baseline performance metrics were established by evaluating multiple open-source Vision Language Models on this new benchmark. This resource aims to facilitate the development of accessible AI systems for multilingual medical contexts.

arxiv arXiv cs.CL · 4h ago

Automatic Generation of Highlights for Academic Paper Via Prompt-based Learning

This study investigates prompt-based learning for the automatic generation of academic paper highlights to address the lack of labeled training data in existing supervised methods. The researchers designed task-specific prompt templates combined with paper abstracts as inputs for several language models, including locally deployed GPT-2 and T5, as well as ChatGPT accessed via API. Experiments conducted on three datasets demonstrated that ChatGPT with prompt templates achieved performance comparable to previous supervised methods without requiring task-specific training samples. When a small number of examples were added to the prompts, the model significantly outperformed state-of-the-art methods on two of the datasets. The analysis revealed that while ChatGPT possesses strong language modeling capabilities, its performance is highly sensitive to the specific information provided within the prompt. Case studies indicated that the generated highlights are generally coherent, informative, and closely resemble those written by authors. This approach does not rely on domain-specific training corpora, supporting downstream text mining and bibliometric research for papers lacking existing highlights.

arxiv arXiv cs.CL · 4h ago

Measuring Research Difficulty in NLP: An Inverted U-Shaped Relationship with Academic Impact

This study proposes a comprehensive evaluation system for measuring the difficulty of academic research, focusing on Natural Language Processing as a case study. The authors extract internal and external features from papers, including collaboration, content, and references, to compute multiple difficulty indicators. These indicators are weighted using the entropy weight method and summed to generate a final research difficulty score. Academic impact is quantified by citation frequency, while expert assessments validate the reliability of the measurement approach. Empirical results indicate that page count, reference count, and high-level institutional participation significantly correlate with academic impact. Crucially, the analysis reveals an inverted U-shaped relationship between research difficulty and impact. This suggests that moderately difficult research tends to achieve the highest level of academic influence.

arxiv arXiv cs.CL · 4h ago

Data-Driven Evolution of Library and Information Science Research Methods (1990-2022)

This study analyzes the influence of data-centric research on Library and Information Science by examining methodological evolution from 1990 to 2022. Researchers automatically extracted four key categories of data-driven entities from academic papers: algorithms and models, data resources, software and tools, and metrics. The analysis evaluates trends across three dimensions, including temporal characteristics, topic-specific evolution, and cross-method features. Findings identify data resources as the primary driver of methodological changes within the discipline. The research reveals a cyclical pattern characterized by emergence followed by stability or practical application in LIS methods. This perspective highlights how big data advancements have reshaped the field's technical landscape over three decades.

arxiv arXiv cs.CL · 5h ago

iLLaDA: An 8B Masked Diffusion Language Model with Fully Bidirectional Attention

The authors introduce iLLaDA, an 8B parameter masked diffusion language model trained from scratch using fully bidirectional attention. This approach contrasts with the predominant autoregressive factorization and causal attention used in modern large language models. The model's pre-training scaled to 12 trillion tokens, followed by supervised fine-tuning on a 25 billion-token instruction corpus for 12 epochs. iLLaDA maintains the masked diffusion objective throughout both training phases and employs variable-length generation for efficiency. It also introduces confidence-based scoring to enhance performance on multiple-choice evaluation tasks. Benchmark results show significant improvements over its predecessor, LLaDA, including gains of 21.6 points on BBH and 14.9 points on ARC-Challenge for the base model. The instruction-tuned variant achieved increases of 14.5 points on MATH and 16.5 points on HumanEval. Despite its non-autoregressive nature, iLLaDA remains competitive with Qwen2.5 7B across several metrics.

arxiv arXiv cs.CL · 5h ago

Hybrid-IR: Dual-Path Hybrid Retrieval with Iterative Reasoning for Complex Medical Question Answering

Large language models face challenges with hallucinations and outdated knowledge in biomedical applications, prompting the development of improved retrieval-augmented generation methods. Existing approaches often struggle with fragmented medical knowledge due to reliance on single retrieval paths and static strategies that hinder deep reasoning. To address these limitations, researchers introduced Hybrid-IR, a dual-path framework featuring an iterative retrieve-reason mechanism for complex medical question answering. This system integrates graph-based retrieval to explore structured knowledge alongside dense retrieval for fine-grained semantic matching. The model progressively refines its reasoning trajectory through an iterative loop between retrieval and reasoning steps. Experiments conducted on three widely used medical QA benchmarks demonstrate the effectiveness of this proposed approach.

arxiv arXiv cs.CL · 5h ago

Local Branch Routing: Efficient Trainable Test-Time Scaling for Language Models

The authors introduce Local Branch Routing (LBR), a token-level framework designed to improve language model reasoning through efficient test-time scaling. LBR expands a small local lookahead tree and forwards all sampled branches through the model, using a lightweight router to select the depth-1 subtree for commitment. This approach allows each token decision to utilize evidence from candidate local futures without incurring the computational costs of full solution-level search. The method employs a prune-shift-grow decoding process that preserves discrete branch identities and defines a tractable tree-trajectory likelihood. Consequently, LBR enables end-to-end reinforcement learning with verifiable rewards, jointly optimizing the base model and router under the same likelihood-ratio principle as discrete-token RLVR. Experimental results on synthetic hierarchical-planning tasks demonstrate that post-candidate hidden states provide useful routing evidence. Furthermore, benchmarks in mathematical reasoning show that LBR improves both Pass@1 and Pass@32 metrics compared to discrete chain-of-thought and other baselines.

arxiv arXiv cs.CL · 5h ago

Memory Makes the Difference: Evaluating How Different Memory Roles Shape Conversational Agents

Prior research on memory mechanisms in RAG-based conversational systems has primarily focused on storage and retrieval methods. This study investigates how memories with distinct functional roles influence response quality across varying contexts. The authors present a fine-grained taxonomy of conversational memory to classify retrieved items into specific role types. They also design a user-centric evaluation framework that simulates user perspectives to address limitations in reference-based assessments. Comparative experiments were conducted on long-term datasets using frontier large language models to analyze these effects. Results indicate that clarifying memory enhances factual accuracy and constraint awareness, leading to more correct and personalized responses. Conversely, irrelevant memory was found to reduce topic relevance and degrade constraint awareness capabilities. These findings demonstrate how different memory types can be leveraged to improve personalization in conversational agents.

arxiv arXiv cs.CL · 5h ago

Neural Machine Translation for Low-Resource Tangkhul-English

This study addresses low-resource machine translation for the Tangkhul-English language pair, focusing on a severely under-resourced Tibeto-Burman language with minimal prior NLP infrastructure. The authors present two systems: a primary model based on ByT5-large and a contrastive system using mT5-small, both fine-tuned on 38,336 parallel sentence pairs. Evaluation on a held-out test set of 3,856 sentences shows the ByT5-large system achieving a corpus BLEU score of 39.97 and a chrF++ score of 58.07. Additional metrics include a BERTScore F1 of 0.8104 and a COMET score of 0.7302 using the wmt22-comet-da model. The research highlights orthographic challenges related to Tangkhul's Latin-script diacritics as a specific technical hurdle. Furthermore, the training corpus exhibits domain bias, consisting primarily of biblical texts, stories, and conversational data. Future work aims to improve performance through data diversification and domain adaptation strategies.

arxiv arXiv cs.CL · 5h ago

Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis

Sarashina2.2-TTS is a Japanese-centric LLM-based text-to-speech system designed to address the linguistic challenge of context-dependent kanji polyphony. The model scales training data to approximately 361k hours, utilizing a balanced mix of Japanese and English speech corpora. To specifically handle reading disambiguation, the authors implemented a targeted data augmentation pipeline covering all 2,136 Joyo regular-use kanji. Alongside the model release, the paper introduces the Joyo Kanji Yomi Benchmark, which includes 4,378 distinct readings for these characters. The authors also propose Kana-CER, a metric that evaluates pronunciation correctness by comparing synthesized speech against reference readings in kana space. Experimental results show that this targeted augmentation significantly improves reading accuracy and achieves state-of-the-art kanji-level performance. The system matches top baselines on general sentence-level pronunciation while delivering the highest speaker similarity in zero-shot synthesis scenarios. Furthermore, cross-lingual evaluations confirm that the balanced training approach ensures stable Japanese pronunciation regardless of the prompt language used.

arxiv arXiv cs.CL · 5h ago

Computational Stylometry of English Pali Canon Translations Across Pitakas

This study presents a computational stylometric analysis of the Tipitaka across all three Pitakas in English translation, extending previous work on the Sutta Pitaka. The corpus comprises 134,831 segments from Bhikkhu Sujato's Sutta Pitaka, Bhikkhu Brahmali's Vinaya Pitaka, I.B. Horner's 1938 Vinaya translation, three English translations of the Abhidhammattha Sangaha, and cross-tradition Vinaya texts. The authors compute Zipf rank-frequency distributions, MATTR-500 lexical diversity, numeral-word density, and vocabulary overlap metrics. Main findings indicate that all corpora show Zipf-consistent distributions with R-squared values above 0.989. The Sutta and Theravada Vinaya exhibit nearly identical lexical diversity scores of 0.399 and 0.400, while the Sangaha corpus is more diverse at 0.560. The Sangaha corpus also displays the highest numeral-word density at 3.26%, reflecting its systematic enumeration of categories. Additionally, the Mulasarvastivada Vinaya shares significant vocabulary overlap with the Theravada Vinaya, whereas two English translations of the same source share only 24.2% of their vocabulary.

arxiv arXiv cs.CL · 5h ago

Story Operators: Decomposing the Original to Sequel Transformation in Embedding Space

This study models literary transformations as geometric operations within a sentence-embedding space using all-mpnet-base-v2 vectors from the PG19 corpus. By calculating displacement vectors between original novels and their sequels, the author decomposes these changes along a content basis derived via PCA. Analysis of thirteen verified author pairs reveals a taxonomy of sequel types: formulaic, concentrated, and compositional. Formulaic transformations involve minimal rank changes, such as Doyle's Holmes collections with a norm of 0.12. Concentrated shifts are dominated by a single axis, exemplified by Alcott's Little Women to Little Men where 75% of the change occurs on one move. Compositional transformations involve many small axes, seen in works by Twain, Burroughs, and Nesbit. For Tom Sawyer to Huckleberry Finn, the dominant axis is structural, reflecting a shift from domesticity to picaresque adventure rather than surface themes like vernacular voice. The geometric findings are corroborated against Mark Twain's documented authorial intent in letters to Howells.