All articles
arxiv arXiv cs.CL · 5h ago

The Geometry of Updates: Fisher Alignment at Vocabulary Scale

This article addresses the challenge of training-free source selection for large language models with shared vocabularies in scientific domains like SMILES and genomics, where classical metrics are either uninformative or computationally prohibitive. The authors demonstrate that representation similarity metrics are non-identifiable for transfer because models can share identical representations yet have orthogonal head updates.

arxiv arXiv cs.CL · 5h ago

Beyond Surface Forms: A Comprehensive, Mechanism-Oriented Taxonomy of Indirect Linguistic Encoding for LLM-Based Coded Language Detection

Researchers propose a mechanism-oriented taxonomy of indirect linguistic expressions (ILE) to categorize the underlying operations used to encode and recover meaning in coded language. This approach abstracts away from communicative goals to focus on the specific encoding mechanisms found in algospeak, euphemisms, and adversarial obfuscation.

arxiv arXiv cs.CL · 5h ago

LLM-Based Examination of Eligibility Criteria from Securities Prospectuses at the German Central Bank

This paper presents the first case study applying Large Language Models to the German Central Bank's process of verifying securities eligibility for collateral, shifting from traditional Named Entity Recognition to a generative Information Extraction pipeline. The approach decomposes the task into extraction, normalization, and interpretation to handle noisy text and bilingual content more effectively.

arxiv arXiv cs.CL · 5h ago

Assessing Post-Reform Changes in Risk Disclosure Quality with a Multidimensional Text Analysis Approach

This study proposes a longitudinal text analysis framework combining Japanese-language NLP metric extraction with paired testing and shift function analysis to evaluate qualitative changes in corporate risk disclosures. Applied to Japan's 2019 disclosure reforms, the approach analyzes 19,770 firm-year observations over ten years to capture multidimensional dynamics often masked by single-indicator methods.

arxiv arXiv cs.CL · 6h ago

Mapping Political-Elite Networks in Europe with a Multilingual Joint Entity-Relation Extraction Pipeline

Researchers present a modular, fully open-weight pipeline for multilingual joint entity-relation extraction that builds signed, temporal knowledge graphs from massive unstructured news corpora. The system combines span-based named-entity recognition with a linking cascade to Wikidata and an ontology-constrained mixture-of-experts model to extract directed relationships.

arxiv arXiv cs.CL · 6h ago

Nemotron-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

NVIDIA introduces Nemotron-TwoTower, a diffusion language model that decouples context representation and iterative denoising into two separate networks to overcome capacity limitations in existing approaches. Built on the open-weight Nemotron-3-Nano-30B-A3B model and trained on 2.1T tokens, it retains 98.7% of the autoregressive baseline's quality while achieving 2.42X higher wall-clock generation throughput.

arxiv arXiv cs.CL · 6h ago

MemStrata: Eliminating Stale-Fact Errors in RAG Agents via Temporal Validity

The article introduces MemStrata, a retrieval memory system designed to eliminate stale-fact errors in AI agents by maintaining temporal validity within accumulated knowledge. Unlike standard Retrieval-Augmented Generation (RAG), which struggles to distinguish between duplicated and contradicted facts due to embedding similarity, MemStrata uses a deterministic supersession rule to retire outdated information.