The Mannheim Data Integration Benchmark (MaDI-Bench) is introduced as the first public benchmark for the end-to-end integration of relational tables, addressing the lack of comprehensive evaluation tools in the field. It covers all steps of the integration process, including schema matching, value normalization, entity blocking, entity matching, and data fusion.

  • MaDI-Bench provides base tasks spanning several application domains that require the full pipeline from schema matching to conflict resolution.
  • The benchmark includes a generic method for deriving task variants to mitigate rapid saturation as data integration systems advance.
  • Validation was performed using human-engineered pipelines, a best-of-breed pipeline, and an LLM-based pipeline.

The benchmark enables the measurement of both step-wise and end-to-end performance of data integration pipelines, with all artifacts available for public download.