Training data — korshunov.ai

Topic · Training data

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.CL · 6d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.

arxiv arXiv cs.CL · 7d ago

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs to dynamically select data mixtures. It outperforms RegMix and DoReMi across 13 downstream tasks, achieving superior results with just 128 proxy models—25% of RegMix's compute budget.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

media Latent Space · 8d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

arxiv arXiv cs.LG · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 +/- 0.3 pp. A measure of LLM-alone discriminability, Delta_sig, correlates strongly with concatenation performance (r^2 = 0.38), and a rule based on Delta_sig <= 13.8 pp correctly predicts non-positive impact in 7 out of 9 datasets.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 9d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.

arxiv arXiv cs.AI · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.AI · 6d ago

Residual-Space Evolutionary Optimization via Flow-based Generative Models

A model-agnostic framework combines flow-based generative editing with evolutionary algorithms to enable data editing in non-differentiable settings. It operates in residual space, using self-pollination for local refinement and cross-pollination for broad exploration, validated on MorphoMNIST and crystal data to balance target alignment, instance preservation, and diversity.

arxiv arXiv cs.AI · 6d ago

Learner-based Concept Drift Detection: Analysis and Evaluation

This study analyzes and evaluates concept drift detection algorithms across various categories using synthetic and real-world streaming datasets. It examines drift characteristics and evaluates detector performance under abrupt and gradual drift scenarios to improve understanding of drift behavior and detector applicability.

arxiv arXiv cs.AI · 6d ago

Novel DTL Approach for Data-Scarce Fault Diagnosis

A new deep transfer learning method leverages systems' non-linearities to generate diagnostic data under severe data scarcity. This approach uses a periodic multi-excitation procedure and a novel data visualization technique to augment limited vibration data, enabling effective fault diagnosis via pre-trained CNNs. Experimental results on a railway pantograph validate the method's effectiveness.

arxiv arXiv cs.LG · 6d ago

Self-Adaptive Scale Handling for Time Series Forecasting

A new module called Self-Adaptive Scale-handling (AS) addresses scale heterogeneity in time series forecasting. It uses Scale Calibrating and Scaling Selection to adaptively adjust scaling factors, preserving semantic discriminability and reducing inverse-scaling errors. Experiments on fund sales data show improved performance when integrated into existing forecasting models.

arxiv arXiv cs.LG · 6d ago

TESSERA and AlphaEarth Embeddings Enable Fine-scale LCZ Mapping in Swiss Cities

A study across five Swiss cities compares TESSERA and AlphaEarth embeddings with traditional Sentinel data to upscale Local Climate Zone maps to 10-meter resolution using an attention-based U-Net. TESSERA consistently outperforms both Sentinel-1/2 and AlphaEarth, achieving IoU scores of 0.59–0.69 and 0.77–0.82. The results show embeddings reduce manual preprocessing and support scalable, reproducible LCZ mapping, though improved reference data is key for further accuracy gains.