Abstract
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving (right). Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene (left). Because people are stationary, geometric constraints hold, thus training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We evaluate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and demonstrate various 3D effects produced using our predicted depth.
Original language | English |
---|---|
Pages (from-to) | 4229-4241 |
Number of pages | 13 |
Journal | IEEE Transactions on Pattern Analysis and Machine Intelligence |
Volume | 43 |
Issue number | 12 |
DOIs | |
State | Published - 1 Dec 2021 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Software
- Computer Vision and Pattern Recognition
- Computational Theory and Mathematics
- Artificial Intelligence
- Applied Mathematics