This work introduces a framework that certifies vision-language model robustness under semantic-level transformations, using text prompts as proxies. It quantifies extent intervals for which predictions remain unchanged, without requiring additional data for each variation. Experiments on synthetic and real-world data demonstrate its effectiveness across diverse semantic variations.