TY - GEN
T1 - On the Effectiveness of ViT Features as Local Semantic Descriptors
AU - Amir, Shir
AU - Gandelsman, Yossi
AU - Bagon, Shai
AU - Dekel, Tali
N1 - Publisher Copyright: © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2023
Y1 - 2023
N2 - We study the use of deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in https://dino-vit-features.github.io/.
AB - We study the use of deep features extracted from a pre-trained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extracted from a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in https://dino-vit-features.github.io/.
UR - http://www.scopus.com/inward/record.url?scp=85151064616&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-25069-9_3
DO - 10.1007/978-3-031-25069-9_3
M3 - منشور من مؤتمر
SN - 9783031250682
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 39
EP - 55
BT - Computer Vision – ECCV 2022 Workshops, Proceedings
A2 - Karlinsky, Leonid
A2 - Michaeli, Tomer
A2 - Nishino, Ko
PB - Springer Science and Business Media B.V.
T2 - 17th European Conference on Computer Vision, ECCV 2022
Y2 - 23 October 2022 through 27 October 2022
ER -