This article introduces Semantic Browsing, a method for generating controlled diversity in text-to-image models by enforcing structure on generated samples to overcome the lack of meaningful variation in current systems. The approach induces diversity directly at the text level rather than relying on stochastic variations within the model.
- Exploits the decoupling of semantic decision-making from pixel generation in recent text-to-image models trained on elaborated captions.
- Leverages rich textual representations to allow a Vision Language Model (VLM) to operate on full scene context.
- Employs an agentic workflow that explicitly enforces structured variation attuned to the original prompt.
- Produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.
The authors consider this important because it enables users to navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation.