Training data — korshunov.ai

Training data Page 1 / 4

Hybrid modeling predicts microbial dynamics in soil systems

A new hybrid modeling framework uses genomic data and neural networks to predict biokinetic parameters in soil organic matter turnover models. It incorporates ecological constraints to ensure realistic microbial dynamics, even for unobserved variables, and outperforms existing methods on both synthetic and real datasets with minimal training data.

arxiv arXiv cs.CL · 6d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

arxiv arXiv cs.CL · 6d ago

Algorithm for Pitch Spelling and Key Estimation in Music Transcription

A new algorithm estimates note names, key signatures, and local scales from MIDI-like input by jointly optimizing modal and tonal stages. It has been evaluated on jazz lead sheets, solo transcriptions, traditional tunes, and classical piano scores, with additional distances defined between common jazz scales for musicological research.

arxiv arXiv cs.CL · 6d ago

CzechDocs: Parallel Dataset for Minority Language Document Translation

CzechDocs is a multiway parallel dataset of formatted documents in HTML, DOCX, and PDF formats, covering Czech and minority languages such as Ukrainian, English, Vietnamese, and Russian. It supports evaluation of machine translation systems that preserve document formatting, with a validation subset and evaluation toolkit publicly released. A held-out test split will be used for a future shared task on document-level translation with formatting preservation.

media r/LocalLLaMA · 7d ago

Does anyone have enough compute to make a distillation dataset from GLM5.2?

A user asks if anyone with sufficient computing resources can create a large distillation dataset of 70-1 million examples from GLM5.2. The goal is to enable better training of smaller models like Qwen3.5, benefiting the broader community.

media r/LocalLLaMA · 7d ago

LocalLLaMA proposes crowdsourced coding dataset

A community initiative suggests creating a crowdsourced coding dataset to enable local LLM development. The proposal aims to allow anyone with hardware to contribute data, with more powerful users helping to fine-tune or quantize models, thus reducing reliance on company-released models.

arxiv arXiv cs.LG · 7d ago

Automated Annotation Framework for Delayed and False AEB Triggers

A new automated system addresses extreme class imbalance and asymmetric label noise in Autonomous Emergency Braking data. It uses targeted data augmentation and noise suppression to identify rare delayed and false triggers with 80% improved recall and 50% reduced manual annotation effort, enabling continuous self-improvement in on-vehicle AEB optimization.

arxiv arXiv cs.LG · 7d ago

XGBoost-Forget for Machine Unlearning in Network Intrusion Detection

XGBoost-Forget enables efficient machine unlearning for XGBoost models on tabular network intrusion datasets. It maintains model performance while achieving faster unlearning compared to full retraining, addressing a gap in unlearning research for tabular data in network intrusion detection.

arxiv arXiv cs.LG · 7d ago

SCAN: Multi-Scale Clustering for Time Series Anomaly Detection

SCAN enhances reconstruction-based time series anomaly detection by integrating multi-scale neighborhood-centered clustering. It uses cluster center representations to constrain normal pattern reconstruction and derives an anomaly confidence score based on cluster membership probability, combined with reconstruction error. Extensive experiments on real-world datasets show SCAN achieves state-of-the-art performance.

arxiv arXiv cs.LG · 7d ago

Optimizing climate scenarios boosts emulator generalization

A new method uses a differentiable simple climate model to optimize training scenarios, enhancing emulator generalization. Training on one optimized scenario outperforms six standard ScenarioMIP pathways, and such scenarios yield more skillful emulators when used with intermediate-complexity models, despite smaller dataset sizes.

arxiv arXiv cs.LG · 7d ago

Chandra-Gaia Catalog Uses Machine Learning to Resolve X-ray and Optical Source Matches

A machine learning framework resolves ambiguous matches between Chandra X-ray and Gaia optical sources by using magnitude, color, and distance data. It identifies counterparts for 113,000 of 254,000 Chandra sources, finds plausible multiple counterparts for 7,000, and validates its performance on the COUP survey with 95% accuracy without positional data.

arxiv arXiv cs.LG · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to U.S. municipal and county ordinances, covering 9,239 cities and counties. It includes a county-harmonized layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata, enables research on legal opacity and paternalism using ModernBERT-based models.

arxiv arXiv cs.AI · 7d ago

XGBoost-Forget for Machine Unlearning in Network Intrusion Detection

arxiv arXiv cs.AI · 7d ago

Taxonomy Links Caregiver Needs to Mental Health Tech

A new taxonomy connects Alzheimer's and dementia caregiver mental health needs with technology interventions. It identifies gaps in support for issues like relational strain and compassion fatigue, and offers a shared framework for designing person-centered, clinically grounded technologies.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.

arxiv arXiv cs.LG · 7d ago

Seed-Guided Semi-Supervised Clustering via A-Contrario Anomaly Detection

A new clustering framework uses a-contrario anomaly detection to define clusters as maximal subsets without anomalies under a null hypothesis of randomness. The Perception algorithm identifies outliers using an expectation-based threshold (\mathbb{E} < 1), enabling robust, parameter-free clustering that expands from minimal seed inputs and handles noise and emerging clusters effectively.

arxiv arXiv cs.LG · 7d ago

Flow-Matching Test-Time Adaptation for OCT Image Denoising

A flow-matching-based method aligns test-time OCT images to synthetic reference trajectories, matching histogram distributions to reduce noise-induced pixel mismatches. By removing time conditioning, the model adapts to real-world noise variations, achieving state-of-the-art biomarker segmentation in Age-related Macular Degeneration stages.

arxiv arXiv cs.LG · 7d ago

LSTM-Vision Transformer Improves HRRR Forecast Error Prediction

A hybrid LSTM-Vision Transformer framework enhances prediction of HRRR forecast errors by integrating atmospheric profiles from mesonet profilers. It achieves up to twofold improvement in precipitation error prediction, especially during active planetary boundary layer periods, by better capturing convective error evolution and reducing PBL-related degradation.

arxiv arXiv cs.LG · 7d ago

Context-Aware Follow-Up Optimization for Type 2 Diabetes

A study uses a Contextual Markov Decision Process to optimize follow-up intervals for Type 2 Diabetes patients based on EHR data from 22,154 patients. The model identifies two clinical contexts—low and high risk—and recommends adaptive intervals: 1 month for unmeasured lab values, up to 3 months for elevated values or hospitalizations, and 6–12 months for stable control, with shorter intervals for high-risk patients. The CMDP policies reduced expected cumulative costs by 34.8% in high-comorbidity and 6.4% in low-comorbidity contexts compared to a fixed interval policy.

arxiv arXiv cs.CL · 7d ago

CDDTLDA: Transfer Learning for Chinese Dialect Discrimination

A novel framework named CDDTLDA uses transfer learning and data augmentation to address Chinese dialects discrimination with limited annotations. It trains a source ASR model on a large dialect corpus, applies speed, pitch, and noise augmentation to low-resource target dialects, and fine-tunes a target ASR model using self-attention to capture shared semantic features. Experimental results show CDDTLDA outperforms state-of-the-art methods on two benchmark Chinese dialect corpora.