PrismML's 1-bit Bonsai-8B beats IBM Granite on CPU tool calling with grammar

A benchmark of PrismML's 1-bit Bonsai-8B model against IBM's Granite and other LLMs reveals that Bonsai-8B achieves the highest tool-calling accuracy when using grammar-constrained decoding. The test, conducted on CPU using llama.cpp, highlights the critical role of output constraints in enabling small, quantized models to function effectively for agent tasks.

Bonsai-8B (Q1_0) achieved a 92% pass rate with GBNF grammar, despite a 0% raw score.
IBM Granite-4.1-3B (Q4_K_M) led in unconstrained decoding with a 72% pass rate.
The evaluation covered 30 deterministic cases including single, parallel, sequential, and abstention tool calls.
Bonsai-8B was perfect across format, parallel, sequential, and abstention categories when grammar was active.

The results suggest that while 1-bit models may fail at unconstrained agent tasks, they possess the necessary semantic capability for tool calling when output is constrained by a grammar.