All articles
arxiv arXiv cs.CL · 12d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

arxiv arXiv cs.CL · 12d ago

NEST: Dataset for Narrative Event Structures in Long Videos

NEST introduces a dataset of 1005 full-length movies, each annotated with 102 multimodal narrative events grounded in visual, dialogue, and audio content. The dataset captures event relationships such as temporal ordering, hierarchy, and long-range dependencies, with benchmark tasks showing low performance in event detection and localization, and higher performance in event relation extraction after fine-tuning.

arxiv arXiv cs.CL · 12d ago

Sequential DPO Shows Variable Preference Impact Across Settings

A study of sequential Direct Preference Optimization finds that later training does not uniformly degrade earlier learned preferences. The effect varies by objective relationship, signal strength, and training order, ranging from partial degradation to positive transfer. Pair-level analysis reveals heterogeneous changes, with high-confidence preference pairs sometimes improving despite aggregate metric stability.

arxiv arXiv cs.CL · 12d ago

Bayesian Curriculum Learning on LLM Latent Manifolds

Manifold Bandits introduces Bayesian Manifold Curriculum (BMC), a framework that models problem sampling as a structured bandit problem in LLMs' latent space. BMC organizes tasks into a hierarchical tree and uses Bayesian learning to guide sampling, revealing tradeoffs between learning signal, task diversity, and evaluation relevance. Prioritizing difficulty alone fails to achieve strong downstream performance, underscoring the need for structure and type-aware sampling.

arxiv arXiv cs.CL · 12d ago

Selective Verification for Budget-Aware Reasoning

Sevra, a serving-layer controller, selectively verifies answers to improve accuracy and reduce token usage. On \mathfive, it achieves 76.3% accuracy with 26.8% fewer post-generation tokens and halved harmful flips, while on \gsm it verifies only 3.0% of examples, boosting accuracy to 94.5% and cutting verification tokens by 91.2%. The study shows that initial solve length and explicit control needs determine optimal verification strategy.

arxiv arXiv cs.CL · 12d ago

Credence: Semantic Metrics and Convergence Analysis for Claim Decomposition

Credence introduces Semantic-F1, a BGE-large cosine similarity metric that improves claim decomposition accuracy over Jaccard by 15-32 percentage points. It establishes convergence theorems for rule- and LLM-based repair, showing rule-based repair is finitely terminating and monotone, while LLM-based repair requires early-exit guards. Evaluations across social-media, encyclopaedic, and news domains show EPR from 0.94 to 1.00, with rule-repair reducing atomicity violations by 47-100% without fidelity loss.

arxiv arXiv cs.CL · 12d ago

Control-Window Law for Single-Neuron Steering in Language Models

A new framework defines when single-neuron interventions coherently control model behaviors without output collapse. The control window, based on alignment and norm ratios, predicts behavior triggers and collapse ceilings using forward pass data, with high accuracy on held-out neurons. On refusal, control is typed: coherent bypass occurs without actionable content, while genuine actionable reach appears only in specific cases and at later rollout stages.