Training data — korshunov.ai

Training data Page 1 / 4

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs to dynamically select data mixtures. It outperforms RegMix and DoReMi across 13 downstream tasks, achieving superior results with just 128 proxy models—25% of RegMix's compute budget.

arxiv arXiv cs.CL · 7d ago

SAMA: Unified Framework for Low-Resource Multimodal Data Augmentation

SAMA introduces a unified framework that generates high-fidelity, task-aware synthetic data by aligning semantic anchors across modalities. It uses a Collaborative Multi-Experts Multimodal Large Language Model with shared and task-specific adapters, and employs an Anchor-Preserving Diffusion mechanism for image synthesis, ensuring semantic consistency while diversifying visual contexts. Extensive experiments show SAMA outperforms state-of-the-art methods in MNER, MRE, and MEE under low-resource conditions.

arxiv arXiv cs.CL · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

A data-centric approach improves long-context reasoning in large language models, using eight curated datasets with 14K examples across retrieval, multi-evidence synthesis, and reasoning tasks. When paired with minimal outcome-based GRPO training, it achieves average gains of +7.2 to +6.4 points on seven benchmarks, outperforming prior RL training sets, and enhances agentic performance by +4.8 and +7.0 points on GAIA and BrowseComp respectively.

arxiv arXiv cs.CL · 7d ago

Distillation with Synthetic Data for Financial Sentiment Analysis

A framework transfers knowledge from large instruction-tuned models to compact ones using synthetic data generated via structured few-shot prompting. Clustering-based seed selection produces more representative synthetic examples than random sampling, enabling compact models to achieve strong performance with minimal human labeling. On complex, noisy financial text, the student model outperforms the teacher model, while remaining competitive on formal text.

arxiv arXiv cs.CL · 7d ago

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen LD Programs

Graph-ESBMC-PLC enables formal verification of graphical IEC 61131-3 Ladder Diagram programs by introducing a DFS-based resolver that converts graphical LD connections into valid GOTO intermediate representation. Validation on three real-world programs shows full IR generation and successful verification of safety properties at k=2 within 70ms, with no regression on textual benchmarks.

arxiv arXiv cs.CL · 7d ago

Middle-to-Late Segments of Research Papers Reveal Key Methodological Information

This study finds that methodological information in research papers is unevenly distributed, with middle-to-late and final segments showing greater discriminative power. Combining these segments with bibliographic metadata improves the accuracy of automatic research method classification in library and information science.

arxiv arXiv cs.CL · 7d ago

Urdu Katib Handwritten Dataset Released for UHTR Research

The Urdu Katib Handwritten Dataset (UKHD) is a new benchmark dataset of offline Urdu handwritten text lines, curated from historical Katib writings in Nastalique calligraphy. It evaluates CRNN-based models, with the CNN-BGRU-CTC architecture showing the lowest error rates, making it a strong baseline for Urdu handwritten text recognition.

arxiv arXiv cs.AI · 7d ago

Data Recipe Boosts Long-Context Reasoning in LLMs

arxiv arXiv cs.AI · 7d ago

Quantum GAN Augmentation Shows No Benefit in Brain MRI

A controlled benchmark found no significant performance gain from quantum generative models in brain MRI augmentation. Synthetic samples produced by quantum and classical GANs were statistically indistinguishable, with both showing mode collapse and off-distribution samples, especially at low data fractions. The study concludes that quantum augmentation does not provide meaningful data expansion and acts more as regularization.

arxiv arXiv cs.AI · 7d ago

LSTM-Vision Transformer Improves HRRR Forecast Error Prediction

A hybrid LSTM-Vision Transformer framework enhances prediction of HRRR forecast errors by integrating atmospheric profiles from mesonet profilers. It achieves up to twofold improvement in precipitation error prediction, especially during active planetary boundary layer periods, by better capturing convective error evolution and reducing PBL-related degradation.

media Latent Space · 7d ago

Radical AI Achieves 10x Acceleration in Materials Discovery

Radical AI has accelerated materials discovery by producing and characterizing 1,200 alloys in six months—nearly 10x faster than DARPA/GE MACH's goal of 500 alloys in a year. Their self-driving labs use AI scientists to generate and test hypotheses in closed-loop systems, leading to 300 new materials with 10 exhibiting novel, state-of-the-art properties now being developed for commercial use.

arxiv arXiv cs.LG · 8d ago

Do Distilled Sets Outperform Coresets?

Large-scale experiments show that state-of-the-art dataset distillation methods are comparable to or worse than coreset selection on ImageNet and ImageNette. Coresets consistently achieve better data coverage and are more computationally efficient, highlighting their practical superiority over distilled sets.

arxiv arXiv cs.CL · 8d ago

Encoding Al-Mawrid Dictionary with ISO LMF and TEI Lex-0

The paper details a methodology for digitizing the Al-Mawrid Arabic-English dictionary using ISO LMF and TEI Lex-0. It achieves 91% structural parsing accuracy and demonstrates 85% precision and 98% recall for synonyms, with 88% precision for morpho-semantic features, based on a sample of the letter Ayn. The study highlights TEI Lex-0 limitations in capturing Arabic semantic and morphological nuances and proposes a scalable prefix-based system for LLOD integration.

arxiv arXiv cs.LG · 8d ago

LLM Features Can Hurt GNNs via Concatenation Interference

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 +/- 0.3 pp. A measure of LLM-alone discriminability, Delta_sig, correlates strongly with concatenation performance (r^2 = 0.38), and a rule based on Delta_sig <= 13.8 pp correctly predicts non-positive impact in 7 out of 9 datasets.

arxiv arXiv cs.LG · 8d ago

Delta-Based Target Reformulation Improves Electricity Load Forecasting

A delta-based target reformulation enhances short-term electricity load forecasting by predicting load changes rather than absolute values. Results show over 50% MAPE reduction for hour-ahead forecasts across LSTM and Transformer models, with significant benefits for deep sequence models in day-ahead predictions.

arxiv arXiv cs.LG · 8d ago

Hybrid Ret-DNN with XGBoost for Customer Behavior Forecasting

A study proposes a hybrid Ret-DNN with XGBoost model to forecast customer behavior in e-commerce. Using 500,000 transaction records from a UK retailer, the model achieves a Mean Absolute Error of 0.2193, outperforming the existing Ret-DNN model.

arxiv arXiv cs.LG · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

McWC introduces a model that separately captures cyclicity, trend, and inter-channel correlations in long-term time series forecasting. It uses multi-layer cyclicity construction, wavelet decomposition, and a multi-layer perceptron to extract and fuse high- and low-frequency information, while decoupling intra-channel autocorrelations via frequency-domain loss. Experiments on six real-world datasets show McWC achieves state-of-the-art performance with high computational efficiency.

arxiv arXiv cs.AI · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

arxiv arXiv cs.AI · 8d ago

Introducing C3GD: A Public Gunshot Audio Dataset

The Certus Caliber Classification Gunshot Dataset (C3GD) contains over 8000 field-collected gunshot audio samples from 28 firearms across 16 calibers. It offers detailed metadata on firearms, calibers, microphones, and placement, enabling robust academic analysis and real-world applications in gunshot detection and audio signal processing.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.