This study evaluates the performance of open-weight large language models running on-premises for text-to-SQL tasks using a reproducible benchmark on the BIRD development split. It compares three model families across two generations while ablating specific accuracy-enhancing techniques to determine their actual value.

  • Qwen2.5-Coder dominates CodeLlama at matched sizes, with 39.1% versus 20.9% execution accuracy at 7B parameters.
  • Llama-3.3-70B achieves competitive results of 49.2% on a matched serving protocol, indicating that generation matters more than raw size.
  • Self-correction provides a robust and statistically significant improvement across all three model families where there is room to improve.
  • Schema linking offers no statistical benefit, as a linker with 96.5% gold-table recall performs indistinguishably from no linking.
  • Self-consistency yields poor value, adding only 0.13 percentage points for approximately five times the token cost without statistical significance.

The authors report real per-stage costs and release all code, predictions, and summaries to help organizations determine which accuracy recipes are worth their compute resources.