Abstract
Visual relationship detection aims to understand real-world interactions between object pairs by detecting visual relation triples written in the form of (subject, predicate, object). Previous work has explored the use of contrastive learning to generate joint visual and language embeddings that aid the detection of both seen and unseen visual relation triples. However, these contrastive approaches often learned the mapping functions implicitly and did not fully consider the underlying structure of visual relation triples, limiting the models' use cases and their ability to generalize to unseen compositions. This ongoing work aims to construct joint visual and language embedding models that can capture such hierarchical structure between objects and predicates by explicitly imposing structural loss constraints. In this short paper, we propose VLTransE, a novel embedding model that applies translational loss in conjunction with the visual-language contrastive loss to learn transferable embedding spaces for subjects, objects, and predicates. At test time, the model ranks potential visual relationships by aggregating the visual-language consistency score and the translational score. The preliminary results show that the contrastive model trained with the translational loss constraint can capture hierarchical information which aids the prediction of not only visual predicates but also masked-out objects, while achieving comparable predicate prediction results to the model trained without the translational loss.
Original language | English |
---|---|
Number of pages | 12 |
Journal | CEUR Workshop Proceedings |
Volume | 3121 |
Publication status | Published - Mar 2022 |
Event | AAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence, AAAI-MAKE 2022 - Stanford University, Palo Alto, United States Duration: 21 Mar 2022 → 23 Mar 2022 |
Keywords
- Contrastive Learning
- Scene Graph
- Translational Embedding
- Visual Relationship Detection
- Zero-shot Learning