TY - GEN
T1 - SceneScape
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
AU - Fridman, Rafail
AU - Abecasis, Amit
AU - Kasten, Yoni
AU - Dekel, Tali
N1 - Publisher Copyright: © 2023 Neural information processing systems foundation. All rights reserved.
PY - 2023
Y1 - 2023
N2 - We present a method for text-driven perpetual view generation - synthesizing long-term videos of various scenes solely from an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles. Project page: https://scenescape.github.io/.
AB - We present a method for text-driven perpetual view generation - synthesizing long-term videos of various scenes solely from an input text prompt describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To tackle the pivotal challenge of achieving 3D consistency, i.e., synthesizing videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene. The depth maps are used to construct a unified mesh representation of the scene, which is progressively constructed along the video generation process. In contrast to previous works, which are applicable only to limited domains, our method generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles. Project page: https://scenescape.github.io/.
UR - http://www.scopus.com/inward/record.url?scp=85191163150&partnerID=8YFLogxK
M3 - منشور من مؤتمر
VL - 36
T3 - Advances in Neural Information Processing Systems
SP - 39897
EP - 39914
BT - Advances in Neural Information Processing Systems
Y2 - 10 December 2023 through 16 December 2023
ER -