Vision UFormer: Long-range monocular absolute depth estimation

Tomas Polasek, Martin Čadík, Yosi Keller, Bedrich Benes

Research output: Contribution to journalArticlepeer-review


We introduce Vision UFormer (ViUT), a novel deep neural long-range monocular depth estimator. The input is an RGB image, and the output is an image that stores the absolute distance of the object in the scene as its per-pixel values. ViUT consists of a Transformer encoder and a ResNet decoder combined with the UNet style of skip connections. It is trained on 1M images across ten datasets in a staged regime that starts with easier-to-predict data such as indoor photographs and continues to more complex long-range outdoor scenes. We show that ViUT provides comparable results for normalized relative distances and short-range classical datasets such as NYUv2 and KITTI. We further show that it successfully estimates absolute long-range depth in meters. We validate ViUT on a wide variety of long-range scenes showing its high estimation capabilities with a relative improvement of up to 23%. Absolute depth estimation finds application in many areas, and we show its usability in image composition, range annotation, defocus, and scene reconstruction. Our models are available at

Original languageEnglish
Pages (from-to)180-189
Number of pages10
JournalComputers and Graphics (Pergamon)
StatePublished - Apr 2023


  • Absolute depth prediction
  • Long-range
  • Monocular
  • Transformer

All Science Journal Classification (ASJC) codes

  • Software
  • General Engineering
  • Signal Processing
  • Human-Computer Interaction
  • Computer Vision and Pattern Recognition
  • Computer Graphics and Computer-Aided Design


Dive into the research topics of 'Vision UFormer: Long-range monocular absolute depth estimation'. Together they form a unique fingerprint.

Cite this