Training data
arxiv arXiv cs.CL · 23h ago

UD_Czech-PDTC: A Large and Genre-Rich Treebank in Universal Dependencies

The Prague Dependency Treebank-Consolidated (PDT-C) has been converted to Universal Dependencies, resulting in UD_Czech-PDTC. This resource is over twice the size of the original PDT and significantly more diverse in genres and domains. Despite structural and granularity differences between PDT-C and UD, the multi-layer annotations of PDT-C provide comprehensive data useful for basic UD trees and beyond.

media Hugging Face Forums · 3d ago

Seeking Indic Document Datasets for AI/OCR Training in India

QuantVectors is seeking annotated document datasets in Indic languages from India, including Hindi, Marathi, Gujarati, Bengali, Punjabi, Tamil, Urdu, Telugu, Odia, Kannada, Malayalam, and Assamese. The datasets must include invoice, receipt, utility bill, payment advice, packing list, commercial invoice, and credit note types, with approximately 400 documents per language, human-verified annotations, and 99%+ accuracy. Datasets must be commercially licensable and can be open-source or commercial, with a request for HuggingFace datasets, research datasets, or vendors specializing in this space.

arxiv arXiv cs.AI · 6d ago

DataMagic Turns Tabular Data into Interactive Insight Videos

DataMagic transforms raw tabular data and natural language queries into narrative data-insight videos. It uses DVSpec to ensure data fidelity by linking visual elements to data fields via semantic references, and employs a multi-agent architecture to generate and orchestrate coherent video scenes. The system supports interactive exploration and provenance-based data Q&A, enabling users to engage with data beyond static views.

arxiv arXiv cs.LG · 6d ago

Topological Data Analysis for Real-Time Process Monitoring

A new method combines topological data analysis and machine learning to monitor high-dimensional dynamic processes. It represents time-series data as manifolds, uses topological descriptors to capture structure, and employs neural ordinary differential equations to model dynamic evolution. The approach effectively detects diverse events in industrial process data and outperforms reconstruction-based and trajectory-based alternatives.

arxiv arXiv cs.AI · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU settings without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet and 36.2% compared to adaptive thresholding, demonstrating superior performance. Ablation results show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data, highlighting the value of pretraining for limited datasets.

arxiv arXiv cs.LG · 6d ago

TESSERA and AlphaEarth Embeddings Enable Fine-scale LCZ Mapping in Swiss Cities

A study across five Swiss cities compares TESSERA and AlphaEarth embeddings with traditional Sentinel data to upscale Local Climate Zone maps to 10-meter resolution using an attention-based U-Net. TESSERA consistently outperforms both Sentinel-1/2 and AlphaEarth, achieving IoU scores of 0.59–0.69 and 0.77–0.82. The results show embeddings reduce manual preprocessing and support scalable, reproducible LCZ mapping, though improved reference data is key for further accuracy gains.

arxiv arXiv cs.LG · 6d ago

EEG Foundation Models for Burst-Suppression Detection in ICU

A study evaluates EEG Foundation Models for event-based burst-suppression detection in ICU EEG without patient-specific calibration. REVE-base achieved the highest event-based F1-score of 0.868 and reduced burst-per-minute error by 52.1% compared to EEGNet. Ablation experiments show full fine-tuning outperforms other strategies, and pretrained REVE-base surpasses random initialization by 0.723 F1 points at 25% labeled data.

arxiv arXiv cs.LG · 6d ago

VibrantForests framework maps forest structure at 10-meter resolution

The VibrantForests framework uses satellite data trained on lidar samples to generate annual, wall-to-wall maps of canopy cover, height, biomass, basal area, and quadratic mean diameter at 10-meter resolution across the contiguous U.S. It improves accuracy by reducing overestimation in sparse forests and underestimation in dense forests, extending the range of reliable predictions beyond traditional passive-sensor models.