TY - GEN
T1 - Sketch-Guided Text-to-Image Diffusion Models
AU - Voynov, Andrey
AU - Aberman, Kfir
AU - Cohen-Or, Daniel
N1 - Publisher Copyright: © 2023 Owner/Author.
PY - 2023/7/23
Y1 - 2023/7/23
N2 - Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a few thousand images and constitutes a differential guiding map predictor, over which the loss is computed and propagated back to push the intermediate images to agree with the spatial map. The per-pixel training offers flexibility and locality which allows the technique to perform well on out-of-domain sketches, including free-hand style drawings. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.
AB - Text-to-Image models have introduced a remarkable leap in the evolution of machine learning, demonstrating high-quality synthesis of images from a given text-prompt. However, these powerful pretrained models still lack control handles that can guide spatial properties of the synthesized images. In this work, we introduce a universal approach to guide a pretrained text-to-image diffusion model, with a spatial map from another domain (e.g., sketch) during inference time. Unlike previous works, our method does not require to train a dedicated model or a specialized encoder for the task. Our key idea is to train a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron (MLP) that maps latent features of noisy images to spatial maps, where the deep features are extracted from the core Denoising Diffusion Probabilistic Model (DDPM) network. The LGP is trained only on a few thousand images and constitutes a differential guiding map predictor, over which the loss is computed and propagated back to push the intermediate images to agree with the spatial map. The per-pixel training offers flexibility and locality which allows the technique to perform well on out-of-domain sketches, including free-hand style drawings. We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images that follow the guidance of a sketch of arbitrary style or domain.
KW - diffusion models
KW - image translation
UR - http://www.scopus.com/inward/record.url?scp=85167996701&partnerID=8YFLogxK
U2 - https://doi.org/10.1145/3588432.3591560
DO - https://doi.org/10.1145/3588432.3591560
M3 - منشور من مؤتمر
T3 - Proceedings - SIGGRAPH 2023 Conference Papers
BT - Proceedings - SIGGRAPH 2023 Conference Papers
A2 - Spencer, Stephen N.
T2 - 2023 Special Interest Group on Computer Graphics and Interactive Techniques Conference, SIGGRAPH 2023
Y2 - 6 August 2023 through 10 August 2023
ER -