Topic · Training data
arxiv arXiv cs.CL · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.

arxiv arXiv cs.LG · 7d ago

Context-Aware Follow-Up Optimization for Type 2 Diabetes

A study uses a Contextual Markov Decision Process to optimize follow-up intervals for Type 2 Diabetes patients based on EHR data from 22,154 patients. The model identifies two clinical contexts—low and high risk—and recommends adaptive intervals: 1 month for unmeasured lab values, up to 3 months for elevated values or hospitalizations, and 6–12 months for stable control, with shorter intervals for high-risk patients. The CMDP policies reduced expected cumulative costs by 34.8% in high-comorbidity and 6.4% in low-comorbidity contexts compared to a fixed interval policy.