Topic · Training data
arxiv arXiv cs.CL · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

media Hugging Face Forums · 3d ago

Seeking Indic Document Datasets for AI/OCR Training in India

QuantVectors is seeking annotated document datasets in Indic languages from India, including Hindi, Marathi, Gujarati, Bengali, Punjabi, Tamil, Urdu, Telugu, Odia, Kannada, Malayalam, and Assamese. The datasets must include invoice, receipt, utility bill, payment advice, packing list, commercial invoice, and credit note types, with approximately 400 documents per language, human-verified annotations, and 99%+ accuracy. Datasets must be commercially licensable and can be open-source or commercial, with a request for HuggingFace datasets, research datasets, or vendors specializing in this space.

arxiv arXiv cs.AI · 6d ago

DataMagic Turns Tabular Data into Interactive Insight Videos

DataMagic transforms raw tabular data and natural language queries into narrative data-insight videos. It uses DVSpec to ensure data fidelity by linking visual elements to data fields via semantic references, and employs a multi-agent architecture to generate and orchestrate coherent video scenes. The system supports interactive exploration and provenance-based data Q&A, enabling users to engage with data beyond static views.

arxiv arXiv cs.CL · 6d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.LG · 6d ago

Topological Data Analysis for Real-Time Process Monitoring

A new method combines topological data analysis and machine learning to monitor high-dimensional dynamic processes. It represents time-series data as manifolds, uses topological descriptors to capture structure, and employs neural ordinary differential equations to model dynamic evolution. The approach effectively detects diverse events in industrial process data and outperforms reconstruction-based and trajectory-based alternatives.