Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding

Tal Shaharabany, Lior Wolf

Research output: Contribution to journalConference articlepeer-review

Abstract

A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted from the latent representation of the model's image encoder. Our main insights are that these maps resemble localization maps and that by combining such maps, one can obtain useful pseudo-labels for performing self-training. Our results surpass, by a large margin, the state of the art in weakly supervised phrase grounding. A similar gap in performance is obtained for a recently proposed downstream task called WWbL, in which only the image is input, without any text. Our code is available at https://github.com/talshaharabany/Similarity-Maps-for-Self-Training-Weakly-Supervised-Phrase-Grounding

Original languageEnglish
Pages (from-to)6925-6934
Number of pages10
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
StatePublished - 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 - Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Keywords

  • Vision
  • language
  • reasoning

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding'. Together they form a unique fingerprint.

Cite this