Abstract
We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular depth estimator. The input is an RGB image, and the output is an image that stores the absolute distance of the object in the scene as its per-pixel values. ViUT consists of a Transformer encoder and a ResNet decoder combined with the UNet style of skip connections. It is trained on 1M images across ten datasets in a staged regime that starts with easier-to-predict data such as indoor photographs and continues to more complex long-range outdoor scenes. We show that ViUT provides comparable results for normalized relative distances and short-range classical datasets such as NYUv2 and KITTI. We further show that it successfully estimates absolute long-range depth in meters. We validate ViUT on a wide variety of long-range scenes showing its high estimation capabilities with a relative improvement of up to 23%. Absolute depth estimation finds application in many areas, and we show its usability in image composition, range annotation, defocus, and scene reconstruction. Our models are available at cphoto.fit.vutbr.cz/viut.
Original language | English |
---|---|
Pages (from-to) | 180-189 |
Number of pages | 10 |
Journal | Computers and Graphics (Pergamon) |
Volume | 111 |
DOIs | |
State | Published - Apr 2023 |
Keywords
- Absolute depth prediction
- Long-range
- Monocular
- Transformer
All Science Journal Classification (ASJC) codes
- Software
- Signal Processing
- General Engineering
- Human-Computer Interaction
- Computer Vision and Pattern Recognition
- Computer Graphics and Computer-Aided Design