TY - GEN
T1 - Anchored Diffusion for Video Face Reenactment
AU - Kligvasser, Idan
AU - Cohen, Regev
AU - Leifman, George
AU - Rivlin, Ehud
AU - Elad, Michael
N1 - Publisher Copyright: © 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, a task of transforming the action from the driving video to the source face. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.
AB - Video generation has drawn significant interest recently, pushing the development of large-scale models capable of producing realistic videos with coherent motion. Due to memory constraints, these models typically generate short video segments that are then combined into long videos. The merging process poses a significant challenge, as it requires ensuring smooth transitions and overall consistency. In this paper, we introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We extend Diffusion Transformers (DiTs) to incorporate temporal information, creating our sequence-DiT (sDiT) model for generating short video segments. Unlike previous works, we train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance, increasing flexibility and allowing it to capture both short and long-term relationships. Furthermore, during inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame, ensuring consistency regardless of temporal distance. To demonstrate our method, we focus on face reenactment, a task of transforming the action from the driving video to the source face. Through comprehensive experiments, we show our approach outperforms current techniques in producing longer consistent high-quality videos while offering editing capabilities.
KW - anchored diffusion
KW - diffusion process
KW - diffusion transformers
KW - face reenactment
KW - generative ai
KW - high-quality videos
KW - sequence-dit
KW - temporal information
KW - transformer architecture
KW - video consistency
KW - video editing
KW - video generation
KW - video synthesis
UR - http://www.scopus.com/inward/record.url?scp=105003636418&partnerID=8YFLogxK
U2 - 10.1109/WACV61041.2025.00402
DO - 10.1109/WACV61041.2025.00402
M3 - منشور من مؤتمر
T3 - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
SP - 4087
EP - 4097
BT - Proceedings - 2025 IEEE Winter Conference on Applications of Computer Vision, WACV 2025
T2 - 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025
Y2 - 28 February 2025 through 4 March 2025
ER -