LordNeel releases GGUF quants of InternScience's 35B Agents-A1 with NVFP4 and MTP speculative decoding

LordNeel has published GGUF quantizations of InternScience's Agents-A1, a 35B Mixture of Experts agent model based on Qwen3.5-MoE. The release includes an NVFP4 format optimized for Blackwell GPUs and integrates multi-token prediction (MTP) speculative decoding to improve inference speed.

The model features ~3B active parameters across 256 experts with a 256K context window, designed for long-horizon search and tool-calling.
Quality was measured using KL-divergence over top-64 next-token distributions on 32 prompts, comparing various quant levels against BF16.
NVFP4 builds require Blackwell GPUs with FP4-capable builds, while other formats like IQ4_XS and Q5_K_M offer compact or near-BF16 fidelity.
MTP speculative decoding was grafted from a separate sidecar checkpoint, yielding up to 1.22× throughput increase on single-user serving.
Draft acceptance rates reached 91.5% for Q4_K_M-MTP with n_max=1, while maintaining text-only functionality without vision support.

The release provides users with optimized options for running the Agents-A1 model locally, balancing size, quality, and speed through specific quantization techniques and speculative decoding.