All articles
arxiv arXiv cs.CL · 10h ago

Majority Vote Silences Minority Values: Annotator Disagreement at the Hate/Offensive Boundary in HateXplain

The study demonstrates that collapsing annotator disagreement into majority vote labels during hate speech annotation is not neutral, as 42.6% of all disagreement concentrates specifically at the hate/offensive boundary. This pattern indicates that annotators apply different thresholds for where hate begins, creating a structural issue in how ground truth is defined.

arxiv arXiv cs.CL · 10h ago

Structure-Preserving Document Translation via Multi-Stage LLM Pipeline: A Case Study in Marathi

This paper presents a framework for translating Marathi government documents to English that maintains layout fidelity and structural integrity, addressing limitations of existing systems that neglect formatting. The system integrates layout-aware OCR, coordinate-based text extraction, LLM translation, and HTML reconstruction to ensure spatial alignment and hierarchical consistency.

arxiv arXiv cs.CL · 11h ago

The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

A comprehensive empirical study reveals that fine-tuning large language models with benign multilingual data significantly increases their tendency to comply with unsafe adversarial prompts, a phenomenon termed multilingual safety drift. The research demonstrates that safety outcomes are highly sensitive to both the language used for fine-tuning and the language of evaluation, with compliance rates increasing four-fold in certain settings.

arxiv arXiv cs.CL · 11h ago

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

This study investigates memory-managed long-context attention by separating a fast recurrent or sparse backbone from explicit editable request-local memory slots and query-time sparse fallback. The research aims to address the limitations of existing linear, recurrent, and sparse attention methods in managing when facts should be written, overwritten, protected, or discarded.

arxiv arXiv cs.CL · 11h ago

PASTA: A Paraphrasing And Self-Training Approach for Knowledge Updating in LLMs

This paper introduces PASTA, a framework designed to integrate detailed factual information from news articles into Large Language Models (LLMs) to address the challenge of knowledge updating. The approach combines data augmentation, question-answering generation, and a novel self-learning Direct Preference Optimization (DPO) process to enable knowledge overwriting and hallucination suppression.

arxiv arXiv cs.CL · 12h ago

FinInvest-GTCN: Explainable Graph-Temporal-Causal Modeling for Risk-Aware Investment Decision Optimization

Researchers introduce FinInvest-GTCN, a Graph-Temporal-Causal Network designed to optimize venture capital investment decisions by addressing challenges like heterogeneous data and non-stationary time series. The model redefines the task from content recommendation to quantitative risk-return assessment, utilizing a relational graph encoder, multi-scale temporal fusion, and a causal decision head to generate interpretable predictions.

arxiv arXiv cs.CL · 12h ago

EVLA: An Electro-Aware Multimodal Assistant for Physically-Grounded Driving Reasoning and Control

The authors introduce the Electro-Visual-Language Assistant (EVLA), a framework that integrates multi-modal scene understanding with real-time perception of an electrified powertrain's electro-mechanical state to improve driving decisions. This approach addresses the limitation of existing vision-language models that treat vehicle dynamics as a black box by incorporating physical constraints and optimization objectives.