A user has reported a critical bug in the Hugging Face text-embeddings-inference library affecting Qwen3 and Gemma3 models. The issue arises when running inference on CPUs with concurrent requests, leading to significant accuracy degradation. Specifically, the Candle backend incorrectly skips attention masks for batches where all input sequences have equal lengths. This defect compromises the reliability of embeddings generated under these specific conditions. To address the problem, the author submitted a pull request containing a fix that was thoroughly tested on their local machines. The bug highlights potential stability risks in CPU-based embedding services handling batched inputs.
Qwen3/Gemma3 Candle Skips Attention Masks for Equal-Length Batches in CPU Mode
from English