A learnable residual speech-to-spike encoder is jointly trained with a Recurrent Leaky Integrate-and-Fire network, achieving up to 94.97% accuracy on the Google Speech Commands v2 benchmark. A 35k-parameter version reaches 89.8%, outperforming prior methods with far fewer parameters, and shows task-aligned spike representations that improve class separability.