This article addresses the problem of resolving entities in large datasets using an oracle that clusters records in limited batches, aiming for a pay-as-you-go approach to control costs while maximizing recall.

  • The problem is formally cast as batched entity resolution, and selecting optimal batches is proven to be NP-hard.
  • An optimal solution is provided under the natural condition of known entity sizes.
  • The proposed approach is evaluated on six datasets, demonstrating superiority over state-of-the-art baselines.