media r/LocalLLaMA · 7d ago · open_models

Benchmarking small LLMs on hard HTML data extraction

from English

A user tested models from 2B to 35B parameters on 29 difficult HTML data extraction pages, finding that smaller models like gemma4 e2b and e4b outperform larger ones. Qwen3.6 27B led in performance, while all MOE models scored poorly, highlighting the importance of task-specific benchmarking.

Importance 2/3 r/LocalLLaMA Alibaba (Qwen) Google DeepMind Mistral AI Code generation Evaluation & benchmarks Reasoning models

Benchmarks

Benchmark	Model	Score
SWE-bench Verified	e4b	—
SWE-bench Verified	gemma4 e2b	—
SWE-bench Verified	Nex N2	—
SWE-bench Verified	Qwen3.5 35B	—
SWE-bench Verified	Qwen3.6 27B	—

Read original