This research investigates the use of large language models to detect scam phone calls in Turkish, a low-resource language where annotated data is scarce. The study introduces the first public multi-modal dataset containing 100 aligned audio-transcript pairs of scam and benign conversations.
- Evaluated seven LLMs across three families: Gemini 2.5 (Flash, Flash-Lite, Pro), GPT-4o, and Qwen (Max, Plus, Turbo).
- Tested three input conditions: raw audio, automatic speech-to-text transcripts, and transcripts refined by a native speaker.
- Found that transcript-based inputs consistently outperform direct audio processing.
- Observed that human-corrected and uncorrected transcripts perform comparably.
The work highlights the urgent need for culturally and linguistically inclusive AI safety research and more robust multi-modal systems for fraud prevention.