Skip to content

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach R , Blattmann A , Lorenz D ,et al.High-Resolution Image Synthesis with Latent Diffusion Models[J]. 2021.DOI:10.48550/arXiv.2112.10752.

Latent Diffusion Models

https://github.com/compvis/latent-diffusion


利用潜在扩散模型进行高分辨率图像合成

Abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.

To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.

By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

通过将图像形成过程分解为去噪自编码器的顺序应用,扩散模型在图像数据及其他领域取得了最先进的合成效果。此外,其模型公式允许一种引导机制来控制图像生成过程,而无需重新训练。然而,由于这些模型通常直接在像素空间操作,强大扩散模型的优化往往需要消耗数百个GPU日,并且由于顺序评估,其推理过程也相当昂贵。

为了在有限的计算资源上训练扩散模型,同时保持其质量和灵活性,我们将其应用于强大的预训练自编码器的潜在空间中。与先前工作相比,在这种表示上训练扩散模型首次实现了在复杂度降低与细节保留之间达到接近最优的平衡点,从而极大提升了视觉保真度。

通过将交叉注意力层引入模型架构,我们将扩散模型转变为强大而灵活的生成器,能够处理诸如文本或边界框等通用条件输入,并且以卷积方式实现高分辨率合成。我们的潜在扩散模型在图像修复和类别条件图像合成任务上取得了新的最优分数,并在包括文本到图像合成、无条件图像生成和超分辨率在内的各种任务上表现出极具竞争力的性能,同时与基于像素的扩散模型相比,显著降低了计算需求。

3.2.Latent Diffusion Models

Diffsuion Models are probabilistic models designed to learn a data distribution p(x) by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length T. For image synthesis, the most successful models rely on a reweighted variant of the variational lower bound on p(x), which mirrors denoising score-matching. These models can be interpreted as an equally weighted sequence of denoising autoencoders ϵθ(xt,t); t=1T, which are trained to predict a denoised variant of their input xt, where xt is a noisy version of the input x. The corresponding objective can be simplified to (Sec. B)

(1)LDM=Ex,ϵN(0,1),t[ϵϵθ(xt,t)22]

with t uniformly sampled from {1,,T}.

3.2.潜在扩散模型

扩散模型 是一种概率模型,旨在通过逐步去噪一个正态分布变量来学习数据分布 p(x),这对应于学习一个长度为 T 的固定马尔可夫链的逆过程。对于图像合成,最成功的模型依赖于 p(x) 的变分下界的一个重新加权变体,这反映了去噪分数匹配的思想。这些模型可以解释为一组等权重的去噪自编码器序列ϵθ(xt,t)t=1T,这些自编码器被训练用于预测其输入 xt 的一个去噪变体,其中 xt 是输入 x 的加噪版本。对应的目标函数可以简化为(见附录B):

(1)LDM=Ex,ϵN(0,1),t[ϵϵθ(xt,t)22]

其中 t{1,,T} 中均匀采样。

Generative Modeling of Latent Representations With our trained perceptual compression models consisting of E and D, we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.

Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space, we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads

(2)LLDM:=EE(x),ϵN(0,1),t[ϵϵθ(zt,t)22]

The neural backbone ϵθ(,t) of our model is realized as a time-conditional UNet. Since the forward process is fixed, zt can be efficiently obtained from E during training, and samples from p(z) can be decoded to image space with a single pass through D.

潜在表征的生成建模 通过我们训练好的、由编码器 E 和解码器 D 组成的感知压缩模型,我们现在可以访问一个高效的、低维的潜在空间,其中高频的、难以察觉的细节被抽象掉了。与高维的像素空间相比,这个空间更适合基于似然的生成模型,因为它们现在可以:(i)专注于数据中重要的、语义性的部分,并且(ii)在一个维度更低、计算效率高得多的空间中进行训练。

与之前依赖于在高度压缩的离散潜在空间中使用自回归的、基于注意力的Transformer模型的工作不同,我们可以利用我们模型所提供的图像特定的归纳偏置。这包括能够主要使用2D卷积层来构建基础的UNet,并且通过使用重新加权的变分下界来进一步将目标集中在感知上最相关的部分,该目标函数现在写作:

(2)LLDM:=EE(x),ϵN(0,1),t[ϵϵθ(zt,t)22]

我们模型的神经主干 ϵθ(,t) 被实现为一个时间条件化的UNet。由于前向过程是固定的,在训练期间可以高效地从 E 获得 zt,并且来自分布 p(z) 的样本只需通过 D 进行一次前向传播即可解码回图像空间。