Labeling Training Data for Entity Matching Using Large Language Models

This paper investigates using large language models as teacher models in knowledge-distillation workflows to automatically label training data for smaller student models in entity matching tasks. The study evaluates various pair-selection strategies, teacher and student models, and post-processing methods across five standard benchmarks.

Student models trained on machine-labeled data perform approximately on par with those trained on benchmark sets, with F1 score differences remaining below two points.
Labeling training sets for five benchmarks using GPT-5.2 costs between US$28.31 and US$40.88, compared to an estimated 470 hours of manual labor.
The Ditto model achieves inference speeds 41.5 to 534 times faster than directly using an LLM for matching tasks.

These results indicate that current LLMs can substantially reduce or eliminate the manual effort required to label use-case-specific training data for entity matching when combined with suitable pair-selection methods.