A user tested models from 2B to 35B parameters on 29 difficult HTML data extraction pages, finding that smaller models like gemma4 e2b and e4b outperform larger ones. Qwen3.6 27B led in performance, while all MOE models scored poorly, highlighting the importance of task-specific benchmarking.
media
r/LocalLLaMA
·
7d ago
·
open_models
Benchmarking small LLMs on hard HTML data extraction
from English
Importance 2/3
r/LocalLLaMA
Alibaba (Qwen)
Google DeepMind
Mistral AI
Code generation
Evaluation & benchmarks
Reasoning models
Benchmarks
| Benchmark | Model | Score |
|---|---|---|
| SWE-bench Verified | e4b | — |
| SWE-bench Verified | gemma4 e2b | — |
| SWE-bench Verified | Nex N2 | — |
| SWE-bench Verified | Qwen3.5 35B | — |
| SWE-bench Verified | Qwen3.6 27B | — |