TY - GEN
T1 - Robot Instance Segmentation with Few Annotations for Grasping
AU - Kimhi, Moshe
AU - Vainshtein, David
AU - Baskin, Chaim
AU - Di Castro, Dotan
N1 - Publisher Copyright: © 2025 IEEE.
PY - 2025/1/1
Y1 - 2025/1/1
N2 - The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains charac-terized by cluttered scenes and high object variability such as traffic, navigation and object grasping, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite tempo-ral gaps without requiring curated data of interaction se-quences. As a result, our approach exploits partially anno-tated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unla-beled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARM-Bench, we attain an AP50 of 86.37, almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an AP50 score of 84.89 with just 1 % of annotated data compared to previous state of the art of 82 which targeted the fully anno-tated dataset.
AB - The ability of robots to manipulate objects relies heavily on their aptitude for visual perception. In domains charac-terized by cluttered scenes and high object variability such as traffic, navigation and object grasping, most methods call for vast labeled datasets, laboriously hand-annotated, with the aim of training capable models. Once deployed, the challenge of generalizing to unfamiliar objects implies that the model must evolve alongside its domain. To address this, we propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI), allowing a model to learn by observing scene alterations and leverage visual consistency despite tempo-ral gaps without requiring curated data of interaction se-quences. As a result, our approach exploits partially anno-tated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unla-beled still images. We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance. Notably, on ARM-Bench, we attain an AP50 of 86.37, almost a 20% improvement over existing work, and obtain remarkable results in scenarios with extremely low annotation, achieving an AP50 score of 84.89 with just 1 % of annotated data compared to previous state of the art of 82 which targeted the fully anno-tated dataset.
KW - computer vision
KW - efficient learning
KW - semi supervised learning
UR - http://www.scopus.com/inward/record.url?scp=105003633674&partnerID=8YFLogxK
U2 - 10.1109/WACV61041.2025.00771
DO - 10.1109/WACV61041.2025.00771
M3 - منشور من مؤتمر
T3 - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
SP - 7939
EP - 7949
BT - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
T2 - 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Y2 - 28 February 2025 through 4 March 2025
ER -