Training data
arxiv arXiv cs.LG · 6d ago

VibrantForests framework maps forest structure at 10-meter resolution

The VibrantForests framework uses satellite data trained on lidar samples to generate annual, wall-to-wall maps of canopy cover, height, biomass, basal area, and quadratic mean diameter at 10-meter resolution across the contiguous U.S. It improves accuracy by reducing overestimation in sparse forests and underestimation in dense forests, extending the range of reliable predictions beyond traditional passive-sensor models.

arxiv arXiv cs.CL · 6d ago

TerraMARS: Small Language Model Pipeline for Mars Terraforming Literature

TerraMARS is an end-to-end pipeline that uses a domain-adapted small language model to extract structured information from Mars science literature. It converts unstructured text into JSON format and supports Mars terraforming-related question answering, enabling integration into habitability modeling and digital twin applications. The pipeline uses Google Gemma 3 1B fine-tuned with QLoRA on Mars-specific datasets, though further work is needed to improve accuracy and factual consistency.

arxiv arXiv cs.CL · 6d ago

CzechDocs: Parallel Dataset for Minority Language Document Translation

CzechDocs is a multiway parallel dataset of formatted documents in HTML, DOCX, and PDF formats, covering Czech and minority languages such as Ukrainian, English, Vietnamese, and Russian. It supports evaluation of machine translation systems that preserve document formatting, with a validation subset and evaluation toolkit publicly released. A held-out test split will be used for a future shared task on document-level translation with formatting preservation.

arxiv arXiv cs.LG · 7d ago

SCAN: Multi-Scale Clustering for Time Series Anomaly Detection

SCAN enhances reconstruction-based time series anomaly detection by integrating multi-scale neighborhood-centered clustering. It uses cluster center representations to constrain normal pattern reconstruction and derives an anomaly confidence score based on cluster membership probability, combined with reconstruction error. Extensive experiments on real-world datasets show SCAN achieves state-of-the-art performance.

arxiv arXiv cs.CL · 7d ago

LOCUS: A Local Ordinance Corpus for the United States

LOCUS provides machine-readable access to nearly all publicly available U.S. municipal and county ordinance codes, covering 9,239 cities and counties. It includes a county-harmonized access layer for 2,309 of 3,144 U.S. counties, serving the majority of the population. The corpus, built with OCR and metadata for reproducibility, enables large-scale analysis of local law, including dimensions like opacity and paternalism, using ModernBERT-based models.