Using LLM Internal Artifacts to Improve Legal Classification Reliability

This study explores leveraging internal artifacts of large language models to detect incorrect predictions in legal classification tasks. The approach uses features from these artifacts to build classifiers that identify erroneous outputs in bail decision and statute violation predictions. Results show internal artifacts reliably indicate incorrect responses, enhancing the overall reliability of LLM-based legal classification systems.