This research investigates the use of large language models to detect scam phone calls in Turkish, a low-resource language where annotated data is scarce. The study introduces the first public multi-modal dataset containing 100 aligned audio-transcript pairs of scam and benign conversations.

  • Evaluated seven LLMs across three families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo).
  • Tested three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker.
  • Found that transcript-based inputs consistently outperform direct audio processing.
  • Observed that human-corrected and uncorrected transcripts perform comparably.

The work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.