Training data — korshunov.ai

Topic · Training data

Concatenating LLM-generated features to graph neural networks systematically reduces accuracy on homophilous benchmarks, with PubMed accuracy dropping by -17.0 ± 0.3 pp. This degradation is linked to LLM-alone discriminability (Delta_sig), which correlates strongly with concatenation cost (r² = 0.38) and shows a power law relationship with feature dimension and node count (r² = 0.97), particularly in low-Delta_sig, low-node scenarios.

arxiv arXiv cs.AI · 8d ago

Stanford EDGAR Filings Dataset Released

Stanford introduces SEFD, an open, layout-faithful reconstruction of SEC filings into MultiMarkdown. The 152B-token SEFD-v1 dataset enables financial language modeling and includes benchmarks for forecasting and table transcription, with less than 0.1% overlap to Common Crawl.

arxiv arXiv cs.AI · 9d ago

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

FusionRS introduces the first large-scale RGB-infrared-text dataset for remote sensing vision-language modeling. It aligns RGB and infrared images with IR-aware captions, enabling dual-modal vision-language foundation models. Experiments show improved RGB-IR alignment, retrieval, and captioning, with ablation studies confirming the critical role of modality-specific textual supervision.

arxiv arXiv cs.LG · 8d ago

Hybrid Ret-DNN with XGBoost for Customer Behavior Forecasting

A study proposes a hybrid Ret-DNN with XGBoost model to forecast customer behavior in e-commerce. Using 500,000 transaction records from a UK retailer, the model achieves a Mean Absolute Error of 0.2193, outperforming the existing Ret-DNN model.

arxiv arXiv cs.LG · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

McWC introduces a model that separately captures cyclicity, trend, and inter-channel correlations in long-term time series forecasting. It uses multi-layer cyclicity construction, wavelet decomposition, and a multi-layer perceptron to extract and fuse high- and low-frequency information, while decoupling intra-channel autocorrelations via frequency-domain loss. Experiments on six real-world datasets show McWC achieves state-of-the-art performance with high computational efficiency.

arxiv arXiv cs.AI · 8d ago

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

arxiv arXiv cs.AI · 8d ago

Introducing C3GD: A Public Gunshot Audio Dataset

The Certus Caliber Classification Gunshot Dataset (C3GD) contains over 8000 field-collected gunshot audio samples from 28 firearms across 16 calibers. It offers detailed metadata on firearms, calibers, microphones, and placement, enabling robust academic analysis and real-world applications in gunshot detection and audio signal processing.

arxiv arXiv cs.CL · 8d ago

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

This study evaluates Word2Vec's ability to capture semantic relationships in Toki Pona, a language with only 130 words. Using 1.4 million sentences, it finds that non-core tokens do not disrupt embedding structure and may actually bring similar words closer in vector space. The results show Word2Vec's effectiveness relies more on distributional patterns than vocabulary size, even at extreme lexical reduction.

arxiv arXiv cs.CL · 8d ago

MultiClin Benchmark for Multiscript ASR in Clinical Settings

MultiClin introduces a clinical ASR benchmark that evaluates models' robustness to multiscript variability. It shows that multiscript-aware evaluation outperforms conventional single-reference methods, and script unification yields the best ASR performance, while inconsistent script mappings increase orthographic uncertainty.

arxiv arXiv cs.CL · 9d ago

IMPACTeen Dataset Released with English and Polish Versions

IMPACTeen is a dataset of 1,021 texts annotated from five perspectives—teenagers, parents, psychologists, communication experts, and teachers. It includes 5,100 annotation records covering social influence techniques, intentions, consequences, and resistance, with annotations validated through human editing. The dataset, created using LLM generation and human validation, is available in both Polish and English and supports research on social influence and language model training.

arxiv arXiv cs.AI · 9d ago

IMPACTeen Dataset Released with English and Polish Versions

arxiv arXiv cs.AI · 9d ago

Textual Reviews Have Limited Impact in Recommendation Models

A study finds that while textual review signals can be fused with collaborative data, their marginal contribution remains limited compared to collaborative signals in matrix factorization models. Adaptive fusion and cross-attention mechanisms improve representation flexibility, but do not significantly boost performance across datasets.

arxiv arXiv cs.LG · 9d ago

Probabilistic Thinning Decouples Inference from State Updates

A new method decouples ML inference from state persistence in streaming systems using probabilistic thinning. It selectively triggers durable state updates based on event informativeness, reducing persistence path overhead by up to 90% without compromising downstream utility or introducing systemic errors.

arxiv arXiv cs.LG · 9d ago

A Mathematical Review of Shape Space Analysis in Machine Learning

This survey presents a mathematical framework for analyzing geometric data, integrating differential geometry, statistics, and machine learning. It outlines a unified pipeline for shape representation, geodesic metrics, statistical analysis, and geometry-aware learning, enabling the study of shape variability and structural trajectories across populations and time. Applications span biology, medicine, anthropology, and computer vision, highlighting challenges in handling nonlinear and unaligned geometric variation.

LLM Features Can Hurt GNNs via Concatenation Interference

Stanford EDGAR Filings Dataset Released

FusionRS: First Large-Scale RGB-Infrared Remote Sensing Dataset

Hybrid Ret-DNN with XGBoost for Customer Behavior Forecasting

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

McWC: Forecasting with Cyclicity, Trend, and Channel Correlation

Introducing C3GD: A Public Gunshot Audio Dataset

Word2Vec's Performance in Toki Pona's Minimal Vocabulary

MultiClin Benchmark for Multiscript ASR in Clinical Settings

IMPACTeen Dataset Released with English and Polish Versions

IMPACTeen Dataset Released with English and Polish Versions

Textual Reviews Have Limited Impact in Recommendation Models

Probabilistic Thinning Decouples Inference from State Updates

A Mathematical Review of Shape Space Analysis in Machine Learning