TW-LegalBench: Evaluating LLMs on Taiwanese Law

TW-LegalBench introduces a benchmark using Taiwan's public legal corpus to assess large language models' performance in Taiwanese law. It includes 16,000+ multiple-choice questions, 117 open-ended essay questions with scoring rubrics, and 14,000+ judgment prediction instances. Evaluation shows top models exceed lawyer passing thresholds (11%) but fall short of judge/prosecutor levels (1-2%), and struggle with precise legal article citations in sentencing predictions.