High-Resolution Image Synthesis with Latent Diffusion Models
Rombach R , Blattmann A , Lorenz D ,et al.High-Resolution Image Synthesis with Latent Diffusion Models[J]. 2021.DOI:10.48550/arXiv.2112.10752.
利用潜在扩散模型进行高分辨率图像合成
Abstract
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.
To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.
By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
通过将图像形成过程分解为去噪自编码器的顺序应用,扩散模型在图像数据及其他领域取得了最先进的合成效果。此外,其模型公式允许一种引导机制来控制图像生成过程,而无需重新训练。然而,由于这些模型通常直接在像素空间操作,强大扩散模型的优化往往需要消耗数百个GPU日,并且由于顺序评估,其推理过程也相当昂贵。
为了在有限的计算资源上训练扩散模型,同时保持其质量和灵活性,我们将其应用于强大的预训练自编码器的潜在空间中。与先前工作相比,在这种表示上训练扩散模型首次实现了在复杂度降低与细节保留之间达到接近最优的平衡点,从而极大提升了视觉保真度。
通过将交叉注意力层引入模型架构,我们将扩散模型转变为强大而灵活的生成器,能够处理诸如文本或边界框等通用条件输入,并且以卷积方式实现高分辨率合成。我们的潜在扩散模型在图像修复和类别条件图像合成任务上取得了新的最优分数,并在包括文本到图像合成、无条件图像生成和超分辨率在内的各种任务上表现出极具竞争力的性能,同时与基于像素的扩散模型相比,显著降低了计算需求。
3.2.Latent Diffusion Models
Diffsuion Models are probabilistic models designed to learn a data distribution
with
3.2.潜在扩散模型
扩散模型 是一种概率模型,旨在通过逐步去噪一个正态分布变量来学习数据分布
其中
Generative Modeling of Latent Representations With our trained perceptual compression models consisting of
Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space, we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads
The neural backbone
潜在表征的生成建模 通过我们训练好的、由编码器
与之前依赖于在高度压缩的离散潜在空间中使用自回归的、基于注意力的Transformer模型的工作不同,我们可以利用我们模型所提供的图像特定的归纳偏置。这包括能够主要使用2D卷积层来构建基础的UNet,并且通过使用重新加权的变分下界来进一步将目标集中在感知上最相关的部分,该目标函数现在写作:
我们模型的神经主干