Researchers propose a term-centric framework for inducing hierarchical taxonomies from diverse text sources, addressing the limitations of existing methods that rely on document-level representations. This approach maps documents into a shared representation space via automatic term extraction to enable robust cross-source alignment and construct interpretable hierarchies.

  • The method integrates domain priors with data-driven clustering to build hierarchies.
  • Experiments utilize a novel English and German multi-source benchmark containing over one million documents.
  • Results show improved cross-source coherence and hierarchy quality compared to text- and summary-based baselines.
  • A case study on German regional innovation analysis demonstrates practical utility for technology landscape mapping.

The framework scales to massive document collections, offering a more effective way to organize knowledge from heterogeneous sources for tasks like policy analysis and innovation monitoring.