Training data — korshunov.ai

Topic · Training data

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

arxiv arXiv cs.CL · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.

arxiv arXiv cs.CL · 7d ago

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs to dynamically select data mixtures. It outperforms RegMix and DoReMi across 13 downstream tasks, achieving superior results with just 128 proxy models—25% of RegMix's compute budget.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

media Latent Space · 8d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

arxiv arXiv cs.LG · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 +/- 0.3 pp. A measure of LLM-alone discriminability, Delta_sig, correlates strongly with concatenation performance (r^2 = 0.38), and a rule based on Delta_sig <= 13.8 pp correctly predicts non-positive impact in 7 out of 9 datasets.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 9d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.

arxiv arXiv cs.AI · 7d ago

XGBoost-Forget for Machine Unlearning in Network Intrusion Detection

XGBoost-Forget enables efficient machine unlearning for XGBoost models on tabular network intrusion datasets. It maintains model performance while achieving faster unlearning compared to full retraining, addressing a gap in unlearning research for tabular data in network intrusion detection.

arxiv arXiv cs.AI · 7d ago

Taxonomy Links Caregiver Needs to Mental Health Tech

A new taxonomy connects Alzheimer's and dementia caregiver mental health needs with technology interventions. It identifies gaps in support for issues like relational strain and compassion fatigue, and offers a shared framework for designing person-centered, clinically grounded technologies.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.

arxiv arXiv cs.LG · 7d ago

Seed-Guided Semi-Supervised Clustering via A-Contrario Anomaly Detection

A new clustering framework uses a-contrario anomaly detection to define clusters as maximal subsets without anomalies under a null hypothesis of randomness. The Perception algorithm identifies outliers using an expectation-based threshold (\mathbb{E} < 1), enabling robust, parameter-free clustering that expands from minimal seed inputs and handles noise and emerging clusters effectively.

arxiv arXiv cs.LG · 7d ago

Flow-Matching Test-Time Adaptation for OCT Image Denoising

A flow-matching-based method aligns test-time OCT images to synthetic reference trajectories, matching histogram distributions to reduce noise-induced pixel mismatches. By removing time conditioning, the model adapts to real-world noise variations, achieving state-of-the-art biomarker segmentation in Age-related Macular Degeneration stages.

arxiv arXiv cs.LG · 7d ago

LSTM-Vision Transformer Improves HRRR Forecast Error Prediction

A hybrid LSTM-Vision Transformer framework enhances prediction of HRRR forecast errors by integrating atmospheric profiles from mesonet profilers. It achieves up to twofold improvement in precipitation error prediction, especially during active planetary boundary layer periods, by better capturing convective error evolution and reducing PBL-related degradation.

arxiv arXiv cs.LG · 7d ago

Context-Aware Follow-Up Optimization for Type 2 Diabetes

A study uses a Contextual Markov Decision Process to optimize follow-up intervals for Type 2 Diabetes patients based on EHR data from 22,154 patients. The model identifies two clinical contexts—low and high risk—and recommends adaptive intervals: 1 month for unmeasured lab values, up to 3 months for elevated values or hospitalizations, and 6–12 months for stable control, with shorter intervals for high-risk patients. The CMDP policies reduced expected cumulative costs by 34.8% in high-comorbidity and 6.4% in low-comorbidity contexts compared to a fixed interval policy.

arxiv arXiv cs.CL · 7d ago

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen LD Programs

Graph-ESBMC-PLC enables formal verification of graphical IEC 61131-3 Ladder Diagram programs by introducing a DFS-based resolver that converts graphical LD connections into valid GOTO intermediate representation. Validation on three real-world programs shows full IR generation and successful verification of safety properties at k=2 within 70ms, with no regression on textual benchmarks.

arxiv arXiv cs.CL · 7d ago

Middle-to-Late Segments of Research Papers Reveal Key Methodological Information

This study finds that methodological information in research papers is unevenly distributed, with middle-to-late and final segments showing greater discriminative power. Combining these segments with bibliographic metadata improves the accuracy of automatic research method classification in library and information science.