Semantic Browsing: Controllable Diversity for Image Generation
Modern text-to-image models often suffer from diversity collapse despite high fidelity. The authors introduce Semantic Browsing to enable controlled diversity through structured image galleries. This method allows users to navigate meaningful axes of variation rather than incidental noise. The approach exploits the decoupling of semantic decision-making and pixel generation in recent models. Diversity is induced directly at the text level using rich textual representations. A Vision Language Model operates on full scene context within an agentic workflow. This workflow explicitly enforces structured variation attuned to the original prompt. The result is a navigable design space with interpretable semantic decisions.