Training data — korshunov.ai

Training data Page 1 / 4

UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

The Prague Dependency Treebank-Consolidated (PDT-C) has been converted to Universal Dependencies, resulting in UD_Czech-PDTC. This resource is over twice the size of the original PDT and significantly more diverse in genres and domains. Despite structural and granularity differences between PDT-C and UD, the multi-layer annotations of PDT-C provide comprehensive data useful for basic UD trees and beyond.

arxiv arXiv cs.CL · 2d ago

Koshur Pixel: First Large-Scale Synthetic OCR Dataset for Kashmiri

Koshur Pixel introduces a synthetic OCR dataset with 613,078 image-text pairs generated from the KS-PRET-5M corpus using SynthOCR-Gen. It includes over 25 augmentation strategies and spans diverse fonts and textual scales, from words to full-page documents, enabling scalable training for Kashmiri OCR systems.

lab NVIDIA Technical Blog · 3d ago

Enable Real-Time AI for High-Speed Data Acquisition with DAQIRI

AlphaFold2's 2020 success relied on 170,000 protein structures from the Protein Data Bank. Nvidia's DAQIRI enables real-time AI processing for high-speed data acquisition by analyzing data as it is generated.

media Hugging Face Forums · 3d ago

Seeking Indic Document Datasets for AI/OCR Training in India

QuantVectors is seeking annotated document datasets in Indic languages from India, including Hindi, Marathi, Gujarati, Bengali, Punjabi, Tamil, Urdu, Telugu, Odia, Kannada, Malayalam, and Assamese. The datasets must include invoice, receipt, utility bill, payment advice, packing list, commercial invoice, and credit note types, with approximately 400 documents per language, human-verified annotations, and 99%+ accuracy. Datasets must be commercially licensable and can be open-source or commercial, with a request for HuggingFace datasets, research datasets, or vendors specializing in this space.

media r/LocalLLaMA · 5d ago

Worlds Biggest Chat Title Dataset Released by SupraLabs

SupraLabs has released a curated chat title dataset with 115K samples, surpassing the previous record of 10K samples. The filtered dataset is available as `SupraLabs/chat-titles-filtered-115K`, while an unfiltered version with 150K samples is also provided, along with a legacy 12K dataset.

arxiv arXiv cs.AI · 6d ago

DataMagic Turns Tabular Data into Interactive Insight Videos

DataMagic transforms raw tabular data and natural language queries into narrative data-insight videos. It uses DVSpec to ensure data fidelity by linking visual elements to data fields via semantic references, and employs a multi-agent architecture to generate and orchestrate coherent video scenes. The system supports interactive exploration and provenance-based data Q&A, enabling users to engage with data beyond static views.

arxiv arXiv cs.AI · 6d ago

Context-Aware Bayesian Model Improves IVF Success Prediction

A hierarchical Bayesian model using 55 context-aware environmental features reduces prediction error to 1.27% in IVF data, compared to 3-5% with raw sensor averages. The model achieves R2 = 0.86 on held-out data and reduces error by 64% for women aged 35-39, showing transferable clinical signal across clinics.

arxiv arXiv cs.LG · 6d ago

Topological Data Analysis for Real-Time Process Monitoring

A new method combines topological data analysis and machine learning to monitor high-dimensional dynamic processes. It represents time-series data as manifolds, uses topological descriptors to capture structure, and employs neural ordinary differential equations to model dynamic evolution. The approach effectively detects diverse events in industrial process data and outperforms reconstruction-based and trajectory-based alternatives.

arxiv arXiv cs.LG · 6d ago

SSH-Net: Deep Network for Failure Time Prediction under Competing Risks

SSH-Net is a structured deep neural network designed to predict failure time distribution functions under competing risks. It uses separate sub-networks for different covariate groups, improving accuracy by aligning neural structure with data hierarchy. The model is validated through simulation studies and applied to Titan GPU failure data.

arxiv arXiv cs.LG · 6d ago

Bias Mitigation under Coverage Constraints and the Price of Fairness

A new framework addresses data bias in machine learning by incorporating coverage constraints to ensure sufficient representation of intersectional subgroups. It trades small bias errors for greater data efficiency and formulates bias mitigation as an integer linear program, characterizing the price of fairness as a function of fairness tolerance to guide data governance and legal compliance.

arxiv arXiv cs.AI · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.AI · 6d ago

Residual-Space Evolutionary Optimization via Flow-based Generative Models

A model-agnostic framework combines flow-based generative editing with evolutionary algorithms to enable data editing in non-differentiable settings. It operates in residual space, using self-pollination for local refinement and cross-pollination for broad exploration, validated on MorphoMNIST and crystal data to balance target alignment, instance preservation, and diversity.

arxiv arXiv cs.AI · 6d ago

Learner-based Concept Drift Detection: Analysis and Evaluation

This study analyzes and evaluates concept drift detection algorithms across various categories using synthetic and real-world streaming datasets. It examines drift characteristics and evaluates detector performance under abrupt and gradual drift scenarios to improve understanding of drift behavior and detector applicability.

arxiv arXiv cs.AI · 6d ago

Novel DTL Approach for Data-Scarce Fault Diagnosis

A new deep transfer learning method leverages systems' non-linearities to generate diagnostic data under severe data scarcity. This approach uses a periodic multi-excitation procedure and a novel data visualization technique to augment limited vibration data, enabling effective fault diagnosis via pre-trained CNNs. Experimental results on a railway pantograph validate the method's effectiveness.

arxiv arXiv cs.LG · 6d ago

Self-Adaptive Scale Handling for Time Series Forecasting

A new module called Self-Adaptive Scale-handling (AS) addresses scale heterogeneity in time series forecasting. It uses Scale Calibrating and Scaling Selection to adaptively adjust scaling factors, preserving semantic discriminability and reducing inverse-scaling errors. Experiments on fund sales data show improved performance when integrated into existing forecasting models.

arxiv arXiv cs.LG · 6d ago

TESSERA and AlphaEarth Embeddings Enable Fine-scale LCZ Mapping in Swiss Cities

A study across five Swiss cities compares TESSERA and AlphaEarth embeddings with traditional Sentinel data to upscale Local Climate Zone maps to 10-meter resolution using an attention-based U-Net. TESSERA consistently outperforms both Sentinel-1/2 and AlphaEarth, achieving IoU scores of 0.59–0.69 and 0.77–0.82. The results show embeddings reduce manual preprocessing and support scalable, reproducible LCZ mapping, though improved reference data is key for further accuracy gains.

arxiv arXiv cs.LG · 6d ago

Comparative Study of Neural Surrogates for Battery State Prediction

A comparative study evaluates four neural architectures—MLP, ResNet, U-Net, and FNO—as autoregressive predictors of internal battery states using the Doyle-Fuller-Newman model. The U-Net achieves a mean final-step nRMSE of 3% across all state variables and provides a 5.38x speed-up over numerical solvers, demonstrating the importance of spatial inductive bias in surrogate performance.

arxiv arXiv cs.LG · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU EEG without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet. Ablation experiments show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data.

arxiv arXiv cs.LG · 6d ago

Learner-based Concept Drift Detection: Analysis and Evaluation

This study analyzes and evaluates concept drift detection algorithms across multiple categories using synthetic and real-world streaming datasets. It examines drift characteristics and evaluates detector performance under abrupt and gradual drift scenarios to improve understanding of drift behavior and detector applicability.

arxiv arXiv cs.LG · 6d ago

VibrantForests framework maps forest structure at 10-meter resolution

The VibrantForests framework uses satellite data trained on lidar samples to generate annual, wall-to-wall maps of canopy cover, height, biomass, basal area, and quadratic mean diameter at 10-meter resolution across the contiguous U.S. It improves accuracy by reducing overestimation in sparse forests and underestimation in dense forests, extending the range of reliable predictions beyond traditional passive-sensor models.