On the Effectiveness of ViT Features as Local Semantic Descriptors

Shir Amir, Yossi Gandelsman, Shai Bagon, Tali Dekel

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We study the use of deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in https://dino-vit-features.github.io/.

Original languageEnglish
Title of host publicationComputer Vision – ECCV 2022 Workshops, Proceedings
EditorsLeonid Karlinsky, Tomer Michaeli, Ko Nishino
PublisherSpringer Science and Business Media B.V.
Pages39-55
Number of pages17
ISBN (Print)9783031250682
DOIs
StatePublished - 2023
Event17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
Duration: 23 Oct 202227 Oct 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13804 LNCS
ISSN (Print)0302-9743

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
Country/TerritoryIsrael
CityTel Aviv
Period23/10/2227/10/22

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'On the Effectiveness of ViT Features as Local Semantic Descriptors'. Together they form a unique fingerprint.

Cite this