Contrastive Visual and Language Translational Embeddings for Visual Relationship Detection

Research output: Contribution to journalConference articlepeer-review

140 Downloads (Pure)


Visual relationship detection aims to understand real-world interactions between object pairs by detecting visual relation triples written in the form of (subject, predicate, object). Previous work has explored the use of contrastive learning to generate joint visual and language embeddings that aid the detection of both seen and unseen visual relation triples. However, these contrastive approaches often learned the mapping functions implicitly and did not fully consider the underlying structure of visual relation triples, limiting the models' use cases and their ability to generalize to unseen compositions. This ongoing work aims to construct joint visual and language embedding models that can capture such hierarchical structure between objects and predicates by explicitly imposing structural loss constraints. In this short paper, we propose VLTransE, a novel embedding model that applies translational loss in conjunction with the visual-language contrastive loss to learn transferable embedding spaces for subjects, objects, and predicates. At test time, the model ranks potential visual relationships by aggregating the visual-language consistency score and the translational score. The preliminary results show that the contrastive model trained with the translational loss constraint can capture hierarchical information which aids the prediction of not only visual predicates but also masked-out objects, while achieving comparable predicate prediction results to the model trained without the translational loss.

Original languageEnglish
Number of pages12
JournalCEUR Workshop Proceedings
Publication statusPublished - Mar 2022
EventAAAI 2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence, AAAI-MAKE 2022 - Stanford University, Palo Alto, United States
Duration: 21 Mar 202223 Mar 2022


  • Contrastive Learning
  • Scene Graph
  • Translational Embedding
  • Visual Relationship Detection
  • Zero-shot Learning


Dive into the research topics of 'Contrastive Visual and Language Translational Embeddings for Visual Relationship Detection'. Together they form a unique fingerprint.

Cite this