DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information

Researchers present DialogPII, a multilingual dataset of synthetic dialog transcripts designed to support the development and evaluation of automatic systems for detecting personally identifiable information. This resource addresses privacy concerns in sensitive domains by providing annotated data across 11 languages and eight interaction scenarios.

Covers 19 entity types across English, Arabic, Finnish, French, German, Hindi, Italian, Polish, Portuguese, Spanish, and Turkish.
Includes eight scenarios: emergency calls, medical anamnesis, therapy sessions, insurance communication, customer support, clinical interviews, police reports, and group therapy.
Data was generated semi-automatically using large language models, localized to specific contexts, and converted to speech via text-to-speech synthesis.
Transcripts were produced using Whisper and annotated through automatic projection with manual correction.
The release includes baseline multilingual named entity recognition models and technical validation metrics.

The dataset provides aligned written and speech-derived resources to facilitate the creation of robust de-identification systems for protecting individual privacy in conversational data.