Training data — korshunov.ai

Topic · Training data

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

media Hugging Face Forums · 3d ago

Seeking Indic Document Datasets for AI/OCR Training in India

QuantVectors is seeking annotated document datasets in Indic languages from India, including Hindi, Marathi, Gujarati, Bengali, Punjabi, Tamil, Urdu, Telugu, Odia, Kannada, Malayalam, and Assamese. The datasets must include invoice, receipt, utility bill, payment advice, packing list, commercial invoice, and credit note types, with approximately 400 documents per language, human-verified annotations, and 99%+ accuracy. Datasets must be commercially licensable and can be open-source or commercial, with a request for HuggingFace datasets, research datasets, or vendors specializing in this space.

arxiv arXiv cs.AI · 6d ago

DataMagic Turns Tabular Data into Interactive Insight Videos

DataMagic transforms raw tabular data and natural language queries into narrative data-insight videos. It uses DVSpec to ensure data fidelity by linking visual elements to data fields via semantic references, and employs a multi-agent architecture to generate and orchestrate coherent video scenes. The system supports interactive exploration and provenance-based data Q&A, enabling users to engage with data beyond static views.

arxiv arXiv cs.LG · 6d ago

Bias Mitigation under Coverage Constraints and the Price of Fairness

A new framework addresses data bias in machine learning by incorporating coverage constraints to ensure sufficient representation of intersectional subgroups. It trades small bias errors for greater data efficiency and formulates bias mitigation as an integer linear program, characterizing the price of fairness as a function of fairness tolerance to guide data governance and legal compliance.

arxiv arXiv cs.CL · 6d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.

arxiv arXiv cs.CL · 7d ago

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs to dynamically select data mixtures. It outperforms RegMix and DoReMi across 13 downstream tasks, achieving superior results with just 128 proxy models—25% of RegMix's compute budget.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

media Latent Space · 7d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

arxiv arXiv cs.LG · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 +/- 0.3 pp. A measure of LLM-alone discriminability, Delta_sig, correlates strongly with concatenation performance (r^2 = 0.38), and a rule based on Delta_sig <= 13.8 pp correctly predicts non-positive impact in 7 out of 9 datasets.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 9d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.

media r/LocalLLaMA · 5d ago

Worlds Biggest Chat Title Dataset Released by SupraLabs

SupraLabs has released a curated chat title dataset with 115K samples, surpassing the previous record of 10K samples. The filtered dataset is available as `SupraLabs/chat-titles-filtered-115K`, while an unfiltered version with 150K samples is also provided, along with a legacy 12K dataset.

arxiv arXiv cs.AI · 6d ago

Context-Aware Bayesian Model Improves IVF Success Prediction

A hierarchical Bayesian model using 55 context-aware environmental features reduces prediction error to 1.27% in IVF data, compared to 3-5% with raw sensor averages. The model achieves R2 = 0.86 on held-out data and reduces error by 64% for women aged 35-39, showing transferable clinical signal across clinics.

arxiv arXiv cs.LG · 6d ago

Topological Data Analysis for Real-Time Process Monitoring

A new method combines topological data analysis and machine learning to monitor high-dimensional dynamic processes. It represents time-series data as manifolds, uses topological descriptors to capture structure, and employs neural ordinary differential equations to model dynamic evolution. The approach effectively detects diverse events in industrial process data and outperforms reconstruction-based and trajectory-based alternatives.