FoCo introduces proxy tasks for zero-shot composed image retrieval

Researchers propose FoCo, a method for Zero-Shot Composed Image Retrieval (ZS-CIR) that models composition as two coordinated stages: focusing on modification-relevant visual content and completing target semantics. The approach utilizes text-anchored visual aggregation and context-conditioned semantic completion to address limitations in existing proxy tasks where the composition function remains unlearned.

FoCo employs text-anchored visual aggregation to selectively gather visual content guided by localized textual semantics.
It uses context-conditioned semantic completion to transform aggregated visuals with remaining scene context into a coherent composed representation.
The tasks are trained jointly with a cross-instance contrastive objective to encourage semantic diversity and discourage shortcut composition strategies.
Extensive experiments on four ZS-CIR benchmarks demonstrate FoCo's state-of-the-art performance and improved generalization.

The authors consider this important because it allows the model to express diverse and fine-grained semantic modifications, overcoming the constraints of predefined composition mechanisms used in prior work.