The authors introduce Cortex, a framework that transforms web-scale corpus construction from flat document filtering into structured knowledge organization using an Ontological Corpus Graph (OCG). This three-layer structure unifies quality-refined content, hierarchical lightweight ontology, and cross-domain alignment to address the escalating data requirements of large language models.
- The OCG consists of a quality-refined content layer, a hierarchical lightweight ontology layer driven by LLMs, and a cross-domain alignment layer for inter-domain association.
- The framework enables the synthesis of CortexBench, a cross-domain search-and-reasoning benchmark evaluated across eight frontier LLMs.
- Evaluation validates the effectiveness of quality refinement, domain organization, and cross-domain data synthesis.
- The complete codebase, a 24.14B-token refined corpus with its OCG, and CortexBench will be publicly released.
This approach addresses the lack of systematic knowledge organization in existing corpus construction pipelines by providing a structured method for managing high-quality, web-scale training data.