Workshop
NeurIPS 2023 Workshop on Diffusion Models
Bahjat Kawar · Valentin De Bortoli · Charlotte Bunne · James Thornton · Jiaming Song · Jong Chul Ye · Chenlin Meng
Hall B1 (level 1)
Over the past three years, diffusion models have established themselves as a new generative modeling paradigm. Their empirical successes have broadened the applications of generative modeling to image, video, audio, 3D synthesis, science applications, and more. As diffusion models become more and more popular and are applied to extremely diverse problems, it also becomes harder to follow the key contributions in the field. This workshop aims to keep track of recent advances and identify guidelines for future research. By bringing together practice, methodology, and theory actors we aim to identify unexplored areas, foster collaboration, and push the frontier of diffusion model research.
Link to website: https://diffusionworkshop.github.io/
Ask questions to our panelists here: https://docs.google.com/forms/d/e/1FAIpQLSeTRsWFvKlsFg31K8Vq6hHGOydmvd7YNMuOLOCcKgqSqO8mXw/viewform
Schedule
Fri 6:50 a.m. - 7:00 a.m.
|
Opening Remarks
(
Opening Remarks
)
>
|
🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Tali Dekel: Pretrained diffusion is all we need: a journey beyond training distribution
(
Keynote
)
>
SlidesLive Video |
🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Brian Trippe & Jason Yim: De novo design of protein structure and function with RFdiffusion
(
Keynote
)
>
SlidesLive Video |
🔗 |
Fri 8:00 a.m. - 8:15 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 8:15 a.m. - 9:15 a.m.
|
Poster Session 1
(
Poster session
)
>
|
🔗 |
Fri 9:15 a.m. - 10:00 a.m.
|
Panel Discussion: Arash Vahdat, Ruiqi Gao, Tim Salimans, Robin Rombach
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Fri 10:00 a.m. - 11:30 a.m.
|
Lunch
(
Break
)
>
|
🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Sayak Paul: Controlling Text-to-Image Diffusion Models
(
Keynote
)
>
SlidesLive Video |
🔗 |
Fri 12:00 p.m. - 12:30 p.m.
|
Yang Song: Consistency Models
(
Keynote
)
>
SlidesLive Video |
🔗 |
Fri 12:30 p.m. - 12:40 p.m.
|
LC-SD: Realistic Endoscopic Image Generation with Stable Diffusion and ControlNet
(
Oral
)
>
link
SlidesLive Video Computer-assisted surgical systems provide support information to the surgeon, which can improve the execution and overall outcome of the procedure. These systems are based on deep learning models that are trained on complex and challenging-to-annotate data. Generating synthetic data can overcome these limitations, but it is necessary to reduce the domain gap between real and synthetic data. We propose a method for image-to-image translation based on a Stable Diffusion model, which generates realistic images starting from synthetic data. Compared to previous works, the proposed method is better suited for clinical application as it requires a much smaller amount of input data and allows finer control over the generation of details by introducing different variants of supporting control networks. The proposed method is applied in the context of laparoscopic cholecystectomy, using synthetic and real data from public datasets. It achieves a mean Intersection over Union of 69.76%, significantly improving the baseline results (69.76% vs. 42.21%). The proposed method for translating synthetic images into images with realistic characteristics will enable the training of deep learning methods that can generalize optimally to real-world contexts, thereby improving computer-assisted intervention guidance systems. |
Joanna Kaleta · Diego Dall'Alba · Szymon Plotka · Przemyslaw Korzeniowski 🔗 |
Fri 12:40 p.m. - 12:50 p.m.
|
Manifold Diffusion Fields
(
Oral
)
>
link
SlidesLive Video We present Manifold Diffusion Fields (MDF), an approach that unlocks learning of diffusion models of data in general non-Euclidean geometries. Leveraging insights from spectral geometry analysis, we define an intrinsic coordinate system on the manifold via the eigen-functions of the Laplace-Beltrami Operator. MDF represents functions using an explicit parametrization formed by a set of multiple input-output pairs. Empirical results on multiple datasets and manifolds including challenging scientific problems like weather prediction or molecular conformation show that MDF can capture distributions of such functions with better diversity and fidelity than previous approaches. |
Ahmed Elhag · Yuyang Wang · Joshua Susskind · Miguel Angel Bautista 🔗 |
Fri 12:50 p.m. - 1:00 p.m.
|
The Emergence of Reproducibility and Consistency in Diffusion Models
(
Oral
)
>
link
SlidesLive Video In this work, we uncover a distinct and prevalent phenomenon within diffusion models in contrast to most other generative models, which we refer to as "consistent model reproducibility''. To elaborate, our extensive experiments have consistently shown that when starting with the same initial noise input and sampling with a deterministic solver, diffusion models tend to produce nearly identical output content. This consistency holds true regardless of the choices of model architectures and training procedures.Additionally, our research has unveiled that this exceptional model reproducibility manifests in two distinct training regimes: (i) "memorization regime,'' characterized by a significantly overparameterized model which attains reproducibility mainly by memorizing the training data; (ii) "generalization regime,'' in which the model is trained on an extensive dataset, and its reproducibility emerges with the model's generalization capabilities. |
Huijie Zhang · Jinfan Zhou · Yifu Lu · Minzhe Guo · Liyue Shen · Qing Qu 🔗 |
Fri 1:00 p.m. - 1:15 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 1:15 p.m. - 2:15 p.m.
|
Invited Lightning Talks
(
Lightning Talks
)
>
SlidesLive Video |
🔗 |
Fri 2:15 p.m. - 3:15 p.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗 |
Fri 3:15 p.m. - 3:30 p.m.
|
Closing Remarks + Award Reveal
(
Closing Remarks
)
>
SlidesLive Video |
🔗 |
-
|
TS-DiffuGen: An equivariant diffusion model for reaction transition state conformation generation
(
Poster
)
>
link
Molecular geometry optimization, particularly in the context of transition state generation, poses significant computational challenges that hinder its use within large-scale reaction workflows. Traditional methods rely on resource-intensive quantum mechanical approaches like density functional theory, demanding both computational resources and substantial prior reaction knowledge. Recent advancements in deep learning-based diffusion models have shown promise in predicting reaction transition state conformations. Current models rely on extensive architectures that capture a reaction's geometry and ensemble models. This work proposes an equivariant diffusion model, designed to address computational expenses and complex architectures. Our model demonstrates robust generalizability and efficiency in predicting transition state conformations, making it a valuable tool for a broader range of chemical reactions. Our approach is a step towards eliminating the computational barriers associated with classic transition state generation techniques, providing chemists with a powerful tool to rapidly propose transition state structures. Code and data can be found on https://figshare.com/s/cb10fda0c88f18d00baf. |
Sacha Raffaud · Jeff Guo · Philippe Schwaller 🔗 |
-
|
EDGE++: Improved Training and Sampling of EDGE
(
Poster
)
>
link
Traditional graph-generative models like the Stochastic-Block Model (SBM) fall short in capturing complex structures inherent in large graphs. Recently developed deep learning models like NetGAN, CELL, and Variational Graph Autoencoders have made progress but face limitations in replicating key graph statistics. Diffusion-based methods such as EDGE have emerged as promising alternatives, however, they present challenges in computational efficiency and generative performance. In this paper, we propose enhancements to the EDGE model to address these issues. Specifically, we introduce a degree-specific noise schedule that optimizes the number of active nodes at each timestep, significantly reducing memory consumption. Additionally, we present an improved sampling scheme that fine-tunes the generative process, allowing for better control over the similarity between the synthesized and the true network. Our experimental results demonstrate that the proposed modifications not only improve the efficiency but also enhance the accuracy of the generated graphs, offering a robust and scalable solution for graph generation tasks. |
Xiaohui Chen · Mingyang Wu · Liping Liu 🔗 |
-
|
The Hidden Linear Structure in Score-Based Models and its Application
(
Poster
)
>
link
Score-based models have achieved remarkable results in the generative modeling of multiple domains. By learning the gradient of smoothed data distribution, they can iteratively generate samples from complex distribution e.g. natural images. However, is there any universal structure in the gradient field that will eventually be learned by any neural network? Here, we aim to find such structures through a normative analysis of the score function. First, we derived the closed-form solution to the scored-based model with a Gaussian score. We claimed that for well-trained diffusion models, the learned score at a high noise scale is well approximated by the linear score of Gaussian. We demonstrated this through empirical validation of pre-trained images diffusion model and theoretical analysis of the score function. This finding enabled us to precisely predict the initial diffusion trajectory using the analytical solution and to accelerate image sampling by 15-30\% by skipping the initial phase without sacrificing image quality. Our finding of the linear structure in the score-based model has implications for how to design and train diffusion models better. |
Binxu Wang · John Vastola 🔗 |
-
|
Controllable Music Production with Diffusion Models and Guidance Gradients
(
Poster
)
>
link
We demonstrate how conditional generation from diffusion models can be used to tackle a variety of realistic tasks in the production of music in 44.1kHz stereo audio with sampling-time guidance. The scenarios we consider include continuation, inpainting and regeneration of musical audio, the creation of smooth transitions between two different music tracks, and the transfer of desired stylistic characteristics to existing audio clips. We achieve this by applying guidance at sampling time in a simple framework that supports both reconstruction and classification losses, or any combination of the two. This approach ensures that generated audio can match its surrounding context, or conform to a class distribution or latent representation specified relative to any suitable pre-trained classifier or embedding model. |
Mark Levy · Bruno Di Giorgi · Floris Weers · Angelos Katharopoulos · Tom R Nickson 🔗 |
-
|
Effective Quantization for Diffusion Models on CPUs
(
Poster
)
>
link
Diffusion models have gained popularity for generating images from textual descriptions. Nonetheless, the substantial need for computational resources continues to present a noteworthy challenge, contributing to time-consuming processes. Quantization, a technique employed to compress deep learning models for enhanced efficiency, presents challenges when applied to diffusion models. These models are notably more sensitive to quantization compared to other model types, potentially resulting in a degradation of image quality. In this paper, we introduce a novel approach to quantize the diffusion models by leveraging both quantization-aware training and distillation. Our results show the quantized models can maintain the high image quality while demonstrating the inference efficiency on CPUs. |
Hanwen Chang · Haihao Shen · Yiyang Cai · Xinyu Ye · Zhenzhong Xu · Wenhua Cheng · Weiwei Zhang · Yintong Lu · Heng Guo 🔗 |
-
|
Staged Diffusion Models with Analytically Designed Hyperparameters
(
Poster
)
>
link
We present StaDM, a framework for staged diffusion modeling where the generation of high-dimensional data is partitioned into multiple stages, with each stage performing diffusion on a projected subspace of dimensions. We derive an analytical objective in terms of an ideal denoiser which allows staging hyper-parameters to be optimized without expensive retraining and generation steps. For the reverse process, we design a back-projection strategy for switching between stages, thereby eliminating the need for training special bridging networks. We illustrate the usefulness of staged diffusion with (1) semi-autoregressive staging where each stage denoises a disjoint subset of dimensions chosen analytically, and (2) multi-resolution staging with analytically chosen switch points instead of existing fixed switch points. On image generation tasks we achieve up to 35\% reduction in sample generation time over homogeneous image diffusion models. The staging hyper-parameters obtained using our method are significantly faster to obtain than empirical generate-and-test approaches. |
Kanad Pardeshi · Shrey Singla · Sunita Sarawagi 🔗 |
-
|
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing
(
Poster
)
>
link
Recently, diffusion-based generative models have achieved remarkable success for image generation and edition. However, their use for video editing still faces important limitations. This paper introduces VidEdit, a novel method for zero-shot text-based video editing ensuring strong temporal and spatial consistency. Firstly, we propose to combine atlas-based and pre-trained text-to-image diffusion models to provide a training-free and efficient editing method, which by design fulfills temporal smoothness. Secondly, we leverage off-the-shelf panoptic segmenters along with edge detectors and adapt their use for conditioned diffusion-based atlas editing. This ensures a fine spatial control on targeted regions while strictly preserving the structure of the original video. Quantitative and qualitative experiments show that VidEdit outperforms state-of-the-art methods on DAVIS dataset, regarding semantic faithfulness, image preservation, and temporal consistency metrics. With this framework, processing a single video only takes approximately one minute, and it can generate multiple compatible edits based on a unique text prompt. |
Paul Couairon · Clément Rambour · Jean-Emmanuel HAUGEARD · Nicolas THOME 🔗 |
-
|
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
(
Poster
)
>
link
Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion model sampling at the cost of sample quality but lack a natural way to trade-off quality for speed. To address this limitation, we propose Consistency Trajectory Model (CTM), a generalization encompassing CM and score-based models as special cases. CTM trains a single neural network that can output scores (i.e., gradients of log-density) and enables unrestricted traversal between any initial and final time along the Probability Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance and achieves new state-of-the-art FIDs for single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at 64x64 resolution (FID 1.98). CTM also enables a new family of sampling schemes, both deterministic and stochastic, involving long jumps along the ODE solution trajectories. |
Dongjun Kim · Chieh-Hsin Lai · Wei-Hsiang Liao · Naoki Murata · Yuhta Takida · Toshimitsu Uesaka · Yutong He · Yuki Mitsufuji · Stefano Ermon 🔗 |
-
|
Learning from Invalid Data: On Constraint Satisfaction in Generative Models
(
Poster
)
>
link
Generative models have demonstrated impressive results in vision, language, and speech. However, even with massive datasets, they struggle with precision, generating physically invalid or factually incorrect data. To improve precision while preserving diversity and fidelity, we propose a novel training mechanism that leverages datasets of constraint-violating data points, which we consider invalid. Our approach minimizes the divergence between the generative distribution and the valid prior while maximizing the divergence with the invalid distribution. We demonstrate how generative models like Diffusion Models and GANs that we augment to train with invalid data improve their standard counterparts which solely train on valid data points. We also explore connections between density ratio and guidance in diffusion models. Our proposed mechanism offers a promising solution for improving precision in generative models while preserving diversity and fidelity, particularly in domains where constraint satisfaction is critical and data is limited, such as engineering design, robotics, and medicine. |
Giorgio Giannone · Lyle Regenwetter · Akash Srivastava · Dan Gutfreund · Faez Ahmed 🔗 |
-
|
Diffusion-C: Unveiling the Generative Challenges of Diffusion Models through Corrupted Data
(
Poster
)
>
link
In our contemporary academic inquiry, we present "Diffusion-C," a foundational methodology to analyze the generative restrictions of Diffusion Models, particularly those akin to GANs, DDPM, and DDIM. By employing input visual data that has been subjected to a myriad of corruption modalities and intensities, we elucidate the performance characteristics of those Diffusion Models. The noise component takes center stage in our analysis, hypothesized to be a pivotal element influencing the mechanics of deep learning systems. In our rigorous expedition utilizing Diffusion-C, we have discerned the following critical observations: (I) Within the milieu of generative models under the Diffusion taxonomy, DDPM emerges as a paragon, consistently exhibiting superior performance metrics. (II) Within the vast spectrum of corruption frameworks, the fog and fractal corruptions notably undermine the functional robustness of both DDPM and DDIM.(III)The vulnerability of Diffusion Models to these particular corruptions is significantly influenced by topological and statistical similarities, particularly concerning the alignment between mean and variance. This scholarly work highlights Diffusion-C’s core understandings regarding the impacts of various corruptions, setting the stage for future research endeavors in the realm of generative models. |
Keywoong Bae · Suan Lee · Wookey Lee 🔗 |
-
|
Beyond Generation: Exploring Generalization of Diffusion Models in Few-shot Segmentation
(
Poster
)
>
link
Diffusion models have demonstrated a superior capability for generating high-quality images. While their proficiency in image generation is evident, the generalization of diffusion models in few-shot segmentation remains scarcely explored. In this paper, we delve into the generalization of the pretrained diffusion model, specifically, Stable Diffusion, within the feature space for few-shot segmentation. First, we propose a straightforward strategy to extract intermediate knowledge from diffusion models as image features, applying them to the few-shot segmentation of real images. Second, we introduce a training-free method that employs pretrained diffusion features for few-shot segmentation. Through extensive experiments on two benchmarks, the proposed method utilizing diffusion features outperforms weakly-supervised few-shot segmentation methods and the DINO-V2 baseline. Without any training on base classes, it also attains comparable performance in comparison with supervised methods. |
Jie Liu · TAO HU · Jan-jakob Sonke · Efstratios Gavves 🔗 |
-
|
Improving Discrete Diffusion Models via Structured Preferential Generation
(
Poster
)
>
link
In the domains of image and audio, diffusion models have shown impressive performance. However, their application to discrete data types, such as language, has often been suboptimal compared to autoregressive generative models. This paper tackles the challenge of improving discrete diffusion models by introducing a structured forward process that leverages the inherent information hierarchy in discrete categories, such as words in text. Our approach biases the generative process to produce certain categories before others, resulting in a notable improvement in log-likelihood scores. This work paves the way for more advanced discrete diffusion models with potentially significant enhancements in performance. |
Severi Rissanen · Markus Heinonen · Arno Solin 🔗 |
-
|
Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization
(
Poster
)
>
link
Diffusion models, rivaling GANs for high-quality sample generation, leverage score matching to learn the score function. Despite empirical success, the provable accuracy of gradient-based algorithms for this task remains unclear. As a first step toward answering this question, this paper establishes a mathematical framework for analyzing score estimation using neural networks trained by gradient descent. The analysis covers both the optimization and the generalization aspects of the training procedure. We propose a parametric form to formulate the denoising score-matching problem as a regression with noisy labels. Compared to standard supervised learning, the score-matching problem introduces distinct challenges, including unbounded inputs, vector-valued outputs, and an additional time variable, preventing existing techniques from being applied directly. With a properly designed neural network architecture, we demonstrate an accurate approximation of the score function using a reproducing kernel Hilbert space induced by neural tangent kernels. We establish the first generalization error bound for learning the score function by applying early stopping and coupling arguments. |
Yinbin Han · Meisam Razaviyayn · Renyuan Xu 🔗 |
-
|
Decoding the Seeing Brain: Reconstructing Images and Text from fMRI Recordings with multimodal diffusion models
(
Poster
)
>
link
The human brain adeptly processes immense visual information using complex neural mechanisms. Recent advances in functional MRI (fMRI) enable decoding this visual information from recorded brain activity patterns. In this work, we present an innovative approach for reconstructing meaningful images and captions directly from fMRI data, with a focus on brain captioning due to its enhanced flexibility over image decoding.We utilize the Natural Scenes fMRI dataset containing brain recordings from subjects viewing images. Our method leverages state-of-the-art image captioning and diffusion models for multimodal decoding. We train regression models between fMRI data and textual/visual features and incorporate depth estimation to guide image reconstruction.Our key innovation is a multimodal framework aligning neural and deep learning representations to generate both semantic captions and photorealistic images from brain activity. We demonstrate quantitative improvements in captioning over prior art and in image spatial relationships through our reconstruction pipeline.In conclusion, this work significantly advances brain decoding capabilities through an integrated vision-language approach. Our flexible decoding platform combining high-level semantic text and low-level visual depth information provides new insights into human visual cognition. The proposed methods could enable future applications in brain-computer interfaces, neuroscience, and AI. |
Matteo Ferrante · Tommaso Boccato · Furkan Ozcelik · Rufin VanRullen · Nicola Toschi 🔗 |
-
|
Fast Sampling via De-randomization for Discrete Diffusion Models
(
Poster
)
>
link
Diffusion models have emerged as powerful tools for high-quality data generation, such as image generation. Despite its success in continuous spaces, discrete diffusion models, which apply to domains such as texts and natural languages, remain under-studied and often suffer from slow generation speed. In this paper, we propose a novel de-randomized diffusion process, which leads to an accelerated algorithm for discrete diffusion models. Our technique significantly reduces the number of function evaluations (i.e., calls to the score network), making the sampling process much faster. Furthermore, we introduce a continuous-time (i.e., infinite-step) sampling algorithm that can provide even better sample qualities than its discrete-time (finite-step) counterpart. Extensive experiments on natural language generation and machine translation tasks demonstrate the superior performance of our method in terms of both generation speed and sample quality over existing methods for discrete diffusion models. |
Zixiang Chen · Angela Yuan · Yongqian Li · Yiwen Kou · Junkai Zhang · Quanquan Gu 🔗 |
-
|
Diffusion models for probabilistic programming
(
Poster
)
>
link
We propose diffusion model variational inference (DMVI), a novel method for automated approximate inference in probabilistic programming languages (PPLs). DMVI utilizes diffusion models as variational approximations to the true posterior distribution by deriving a novel bound to the marginal likelihood objective used in Bayesian modelling. DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model. We evaluate DMVI on a set of common Bayesian models and show that its posterior inferences are in general more accurate than those of contemporary methods used in PPLs while having a similar computational cost and requiring less manual tuning. |
Simon Dirmeier · Fernando Perez-Cruz 🔗 |
-
|
A Denoising Diffusion Model for Synthetic Fluid Field Prediction
(
Poster
)
>
link
We present a novel denoising-diffusion-based generative model framework for predicting synthetic nonlinear fluid fields. The model utilizes forward and inverse diffusion processes to learn complex representations of high-dimensional dynamic systems and predict spatial-temporal evolution trajectories by sampling from learned posteriors. Additionally, we introduce a modified physics-informed loss that provides a physically meaningful regularization in the training pipelines. We demonstrate the model's predictive capacity in different experiments using numerical simulations. We examine the predictive capacity of the model both qualitatively and quantitatively. Results show that the model predicts fluid features associated with spatial-temporal coordinates without numerically solving fluid-governing partial differential equations. Overall, this work demonstrates the potential of denoising diffusion generative models as a promising direction for further investigation into developing new computational fluid dynamics tools and broader applications. |
Gefan Yang · Stefan Sommer 🔗 |
-
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Poster
)
>
link
The significant success of AlphaFold2 at protein structure prediction has pointed to structural ensembles as the next frontier towards a more complete computational understanding of protein structure. At the same time, iterative refinement-based techniques such as diffusion have driven significant breakthroughs in generative modeling. We explore the synergy of these developments by combining highly accurate protein structure prediction models with flow matching, a powerful modern generative modeling framework, in order to sample the conformational landscape of proteins. Preliminary results on membrane transporters, ligand-induced conformational change, and disordered ensembles show the potential of the approach. Importantly, and unlike MSA-based methods, our method also obtains similar distributions even when used with language model-based algorithms such as ESMFold, which are otherwise deterministic given an input sequence. These results open exciting avenues in the computational prediction of conformational flexibility. |
Bowen Jing · Bonnie Berger · Tommi Jaakkola 🔗 |
-
|
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
(
Poster
)
>
link
The remarkable capabilities of pretrained image diffusion models have been utilized not only for generating fixed-size images but also for creating panoramas. However, naive stitching of multiple images often results in visible seams. Recent techniques have attempted to address this issue by performing joint diffusions in multiple windows and averaging latent features in overlapping regions. However, these approaches, which focus on seamless montage generation, often yield incoherent outputs by blending different scenes within a single image. To overcome this limitation, we propose SyncDiffusion, a plug-and-play module that synchronizes multiple diffusions through gradient descent from a perceptual similarity loss. Specifically, we compute the gradient of the perceptual loss using the predicted denoised images at each denoising step, providing meaningful guidance for achieving coherent montages. We demonstrate the versatility of SyncDiffusion by applying our method onto three applications: text-guided panorama generation, conditional panorama generation and 360-degree panorama generation. Moreover, our experimental results suggest that our method produces significantly more coherent outputs compared to previous methods (66.35\% vs. 33.65\% in our user study) while still maintaining fidelity (as assessed by GIQA) and compatibility with the input prompt (as measured by CLIP score). |
Yuseung Lee · Kunho Kim · Hyunjin Kim · Minhyuk Sung 🔗 |
-
|
Retrofitting 2D Latent Diffusion for Controllable Face Image Generation
(
Poster
)
>
link
In this paper, we propose RetroLaDi, a 3D-aware face image generation method by Retrofitting a 2D Latent Diffusion model. The core procedure is to integrate and reconcile external 3D priors from 3DMMs with the internal knowledge in a pretrained 2D diffusion model.Experimental results demonstrate its ability to generate high-fidelity face images with precise controllability than state-of-the-art 2D-based and 3D-based controllable face synthesis methods. |
Weihao Xia · Cengiz Oztireli · Jing-Hao Xue 🔗 |
-
|
IterInv: Iterative Inversion for Pixel-Level T2I Models
(
Poster
)
>
link
Large-scale text-to-image diffusion models have been a ground-breaking development in generating convincing images following an input text prompt. The goal of image editing research is to give users control over the generated images by modifying the text prompt. Current image editing techniques are relying on DDIM inversion as a common practice based on the Latent Diffusion Models (LDM). However, the large pretrained T2I models working on the latent space as LDM suffer from losing details due to the first compression stage with an autoencoder mechanism. Instead, another mainstream T2I pipeline working on the pixel level, such as Imagen and DeepFloyd-IF, avoids this problem. They are commonly composed of several stages, normally with a text-to-image stage followed by several super-resolution stages. In this case, the DDIM inversion is unable to find the initial noise to generate the original image given that the super-resolution diffusion models are not compatible with the DDIM technique. According to our experimental findings, iteratively concatenating the noisy image as the condition is the root of this problem. Based on this observation, we develop an iterative inversion (IterInv) technique for this stream of T2I models and verify IterInv with the open-source DeepFloyd-IF model. By combining our method IterInv with a popular image editing method, we prove the application prospects of IterInv. The code will be released once upon acceptance. |
Chuanming Tang · kai wang · Joost van de Weijer 🔗 |
-
|
Investigating the Adversarial Robustness of Density Estimation Using the Probability Flow ODE
(
Poster
)
>
link
Beyond their impressive sampling capabilities, score-based diffusion models offer a powerful analysis tool in the form of unbiased density estimation of a query sample under the training data distribution.In this work, we investigate the robustness of density estimation using the probability flow (PF) neural ordinary differential equation (ODE) model against gradient-based likelihood maximization attacks and the relation to sample complexity, where the compressed size of a sample is used as a measure of its complexity.We introduce and evaluate six gradient-based log-likelihood maximization attacks, including a novel reverse integration attack. Our experimental evaluations on CIFAR-10 show that density estimation using the PF ODE is robust against high-complexity, high-likelihood attacks, and that in some cases adversarial samples are semantically meaningful, as expected from a robust estimator. |
Marius Arvinte · Cory Cornelius · Jason Martin · Nageen Himayat 🔗 |
-
|
Towards Fast Stochastic Sampling in Diffusion Generative Models
(
Poster
)
>
link
Diffusion models suffer from slow sample generation at inference time. Despite recent efforts, improving the sampling efficiency of stochastic samplers for diffusion models remains a promising direction. We propose Splitting Integrators for fast stochastic sampling in pre-trained diffusion models in augmented spaces. Commonly used in molecular dynamics, splitting-based integrators attempt to improve sampling efficiency by cleverly alternating between numerical updates involving the data, auxiliary, or noise variables. However, we show that a naive application of splitting integrators is sub-optimal for fast sampling. Consequently, we propose several modifications to improve sampling efficiency and denote the resulting samplers as Reduced Splitting Integrators. In the context of Phase Space Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our stochastic sampler achieves an FID score of 2.36 in only 100 network function evaluations (NFE) as compared to 2.63 for the best baselines. |
Kushagra Pandey · Maja Rudolph · Stephan Mandt 🔗 |
-
|
Motion Flow Matching for Efficient Human Motion Synthesis and Editing
(
Poster
)
>
link
Human motion synthesis is a fundamental task in the field of computer animation. Recent methods based on diffusion models or GPT structure demonstrate commendable performance but exhibit drawbacks in terms of slow sampling speeds or the accumulation of errors. In this paper, we propose Motion Flow Matching, a novel generative model designed for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from 1000 steps in previous diffusion models to just 10 steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks. Noticeably, our approach establishes a new state-of-the-art result of Fréchet Inception Distance on the KIT-ML dataset. What is more, we tailor a straightforward motion editing paradigm named trajectory rewriting leveraging the ODE-style generative models and apply it to various editing scenarios including motion prediction, motion in-between prediction, motion interpolation, and upper-body editing. |
TAO HU · Wenzhe Yin · Pingchuan Ma · Yunlu Chen · Basura Fernando · Yuki M Asano · Efstratios Gavves · Pascal Mettes · Björn Ommer · Cees Snoek 🔗 |
-
|
In search of dispersed memories: Generative diffusion models are associative memory networks
(
Poster
)
>
link
Hopfield networks are widely used in neuroscience as simplified theoretical models of biological associative memory. The original Hopfield networks store memories by encoding patterns of binary associations, which result in a synaptic learning mechanism known as Hebbian learning rule. Modern Hopfield networks can achieve exponential capacity scaling by using highly non-linear energy functions. However, the energy function of these newer models cannot be straightforwardly compressed into binary synaptic couplings and it does not directly provide new synaptic learning rules. In this work we show that generative diffusion models can be interpreted as energy-based models and that, when trained on discrete patterns, their energy function is equivalent to that of modern Hopfield networks. This equivalence allows us to interpret the supervised training of diffusion models as a synaptic learning process that encodes the associative dynamics of a modern Hopfield network in the weight structure of a deep neural network. Accordingly, in our experiments we show that the storage capacity of a continuous modern Hopfield network is identical to the capacity of a diffusion model. Our results establish a strong link between generative modeling and the theoretical neuroscience of memory, which provide a powerful computational foundation for the reconstructive theory of memory, where creative generation and memory recall can be seen as parts of a unified continuum. |
Luca Ambrogioni 🔗 |
-
|
Improved Convergence of Score-Based Diffusion Models via Prediction-Correction
(
Poster
)
>
link
Score-based generative models (SGMs) are powerful tools to sample from complex data distributions. The idea is to run an ergodic stochastic process for time $T_1$ and then learn to revert this process. As the approximate reverse process is initialized with the stationary distribution of the forward one, the existing analysis paradigm requires $T_1\to\infty$. This is however problematic, as it leads to error propagation, unstable convergence results and increased computational cost. We address the issue by considering a version of the popular predictor-corrector scheme: after running the forward process, we first estimate the final distribution via an inexact Langevin dynamics and then revert the process.Our main results provide convergence guarantees for this scheme, which has the key advantage that the forward process is required to run only for a fixed finite time $T_1$.Our bounds exhibit a mild logarithmic dependence on the input dimension and the subgaussian norm of the target distribution, have minimal assumptions on the data, and require only to control the $L^2$ loss of the score approximation.
|
Francesco Pedrotti · Jan Maas · Marco Mondelli 🔗 |
-
|
You Only Submit One Image to Find the Most Suitable Generative Model
(
Poster
)
>
link
Deep generative models have achieved promising results in image generation, and various generative model hubs, e.g., Hugging Face and Civitai, have been developed that enable model developers to upload models and users to download models. However, these model hubs lack advanced model management and identification mechanisms, resulting in users only searching for models through text matching, download sorting, etc., making it difficult to efficiently find the model that best meets user requirements. In this paper, we propose a novel setting called Generative Model Identification (GMI), which aims to enable the user to identify the most appropriate generative model(s) for the user's requirements from a large number of candidate models efficiently. To our best knowledge, it has not been studied yet. In this paper, we introduce a comprehensive solution consisting of three pivotal modules: a weighted Reduced Kernel Mean Embedding (RKME) framework for capturing the generated image distribution and the relationship between images and prompts, a pre-trained vision-language model aimed at addressing dimensionality challenges, and an image interrogator designed to tackle cross-modality issues. Extensive empirical results demonstrate the proposal is both efficient and effective. For example, users only need to submit a single example image to describe their requirements, and the model platform can achieve an average top-4 identification accuracy of more than 80%. |
Zhi Zhou · Lan-Zhe Guo · Pengxiao Song · Yu-Feng Li 🔗 |
-
|
Denoising Heat-inspired Diffusion with Insulators for Collision Free Motion Planning
(
Poster
)
>
link
Diffusion models have risen as a powerful tool in robotics due to their flexibility and multi-modality. While some of these methods effectively address complex problems, they often depend heavily on inference-time obstacle detection and require additional equipment. Addressing these challenges, we present a method that, during inference time, simultaneously generates only reachable goals and plans motions that avoid obstacles, all from a single visual input. Central to our approach is the novel use of a collision-avoiding diffusion kernel for training. Through evaluations against behavior-cloning and classical diffusion models, our framework has proven its robustness. It is particularly effective in multi-modal environments, navigating toward goals and avoiding unreachable ones blocked by obstacles, while ensuring collision avoidance. |
Junwoo Chang · Hyunwoo Ryu · Jiwoo Kim · Soochul Yoo · Joohwan Seo · Nikhil Potu Surya Prakash · Jongeun Choi · Roberto Horowitz 🔗 |
-
|
DiffEnc: Variational Diffusion with a Learned Encoder
(
Poster
)
>
link
Diffusion models may be viewed as hierarchical variational autoencoders (VAEs) with two improvements: parameter sharing for the conditionals in the generative process and efficient computation of the loss as independent terms over the hierarchy. We consider two modifications that retain these advantages while increasing model flexibility. Firstly, we introduce an encoder that defines a data- and depth-dependent mean function in the diffusion process, which leads to a modified diffusion loss. Our proposed framework, DiffEnc, achieves state-of-the-art likelihood on CIFAR-10. Secondly, we let the ratio of the noise variance of the reverse encoder process and the generative process be a free weight parameter rather than being fixed to one. This leads to theoretical insights: For a finite-depth hierarchy, the evidence lower bound (ELBO) can be used as an objective for a weighted diffusion loss approach and for optimizing the noise schedule specifically for inference. For the infinite-depth hierarchy, on the other hand, the weight parameter has to be one to have a well-defined ELBO. |
Beatrix M. G. Nielsen · Anders Christensen · Andrea Dittadi · Ole Winther 🔗 |
-
|
Beyond U: Making Diffusion Models Faster & Lighter
(
Poster
)
>
link
Diffusion models are a family of generative models that yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design. Despite their capabilities, their efficiency, especially in the reverse denoising process, remains a challenge due to slow convergence rates and high computational costs. In this work, we introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models that is more parameter-efficient, exhibits faster convergence, and demonstrates increased noise robustness. Experimenting with denoising probabilistic diffusion models, our framework operates with approximately a quarter of the parameters and 30% of the Floating Point Operations (FLOPs) compared to standard U-Nets in Denoising Diffusion Probabilistic Models (DDPMs). Furthermore, our model is up to 70% faster in inference than the baseline models when measured in equal conditions while converging to better quality solutions. |
Sergio Calvo Ordoñez · Jiahao Huang · Lipei Zhang · Guang Yang · Carola-Bibiane Schönlieb · ANGELICA I AVILES-RIVERO 🔗 |
-
|
Understanding Denoising Diffusion Probabilistic Models and their Noise Schedules via the Ornstein--Uhlenbeck Process
(
Poster
)
>
link
The aim of this short note is to show that Denoising Diffusion Probabilistic Model (DDPM), a non-homogeneous discrete-time Markov process, can be represented by a time-homogeneous continuous-time Markov process observed at non-uniformly sampled discrete times. Surprisingly, this continuous-time Markov process is the well-known and well-studied Ornstein--Ohlenbeck (O--U) process, which was developed in 1930's for studying Brownian particles in Harmonic potentials. We establish the formal equivalence between DDPM and the O--U process using its analytical solution. We further demonstrate that the design problem of the noise scheduler for non-homogeneous DDPM is equivalent to designing observation times for the O--U process. We present several heuristic designs for observation times based on principled quantities such as auto-variance and Fisher Information and connect them to \emph{ad hoc} noise schedules for DDPM. Interestingly, we show that the Fisher-Information-motivated schedule corresponds exactly the \emph{cosine schedule}, which was developed without any theoretical foundation but is the current state-of-the-art noise schedule. Our numerical experiments further show its superior performance. |
Javier E. Santos · Yen Ting Lin 🔗 |
-
|
Diffusion based Zero-shot Medical Image-to-Image Translation for Cross Modality Segmentation
(
Poster
)
>
link
Cross-modality image segmentation aims to segment the target modalities using a method designed in the source modality. Deep generative models can translate the target modality images into the source modality, thus enabling cross-modality segmentation.However, a vast body of existing cross-modality image translation methods relies on supervised learning. In this work, we aim to address the challenge of zero-shot learning-based image translation tasks (extreme scenarios in the target modality is unseen in the training phase).To leverage generative learning for zero-shot cross-modality image segmentation, we propose a novel unsupervised image translation method. The framework learns to translate the unseen source image to the target modality for image segmentation by leveraging the inherent statistical consistency between different modalities for diffusion guidance. Our framework captures identical cross-modality features in the statistical domain, offering diffusion guidance without relying on direct mappings between the source and target domains. This advantage allows our method to adapt to changing source domains without the need for retraining, making it highly practical when sufficient labeled source domain data is not available.The proposed framework is validated in zero-shot cross-modality image segmentation tasks through empirical comparisons with influential generative models, including adversarial-based and diffusion-based models. |
zihao wang · YINGYU YANG · Yuzhou Chen · Tingting Yuan · Maxime Sermesant · Herve Delingette · Ona Wu 🔗 |
-
|
Harmonic Prior Self-conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
(
Poster
)
>
link
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon the state-of-the-art generative processes for docking in simplicity, generality, and performance. Enabled by this structure model, FlowSite designs binding sites substantially better than baseline approaches and provides the first general solution for binding site design. |
Hannes Stärk · Bowen Jing · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
Sharp analysis of learning a flow-based generative model from limited sample complexity
(
Poster
)
>
link
We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt generative flow, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from the target distribution. Building on this analysis, we provide closed-form formulae for the distance between the means of the generated mixture and the mean of the target mixture, which we show decays as $\Theta_n(\frac{1}{n})$. Finally, this rate is shown to be in fact Bayes-optimal.
|
Hugo Cui · Eric Vanden-Eijnden · Florent Krzakala · Lenka Zdeborová 🔗 |
-
|
Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective
(
Poster
)
>
link
Generative models based on various neural network architectures, including flows, diffusion, and autoregressive models, have achieved significant success in data generation across diverse applications, but their theoretical analysis and limitations understanding remain challenging. In this paper, we undertake a step in this direction by analysing the efficiency of sampling by these methods on a class of problems with a known probability distribution and comparing it with the sampling performance of more traditional methods such as the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of probability distribution widely studied in the statistical physics of disordered systems that relate to spin glasses, statistical inference and constraint satisfaction problems. We leverage the fact that sampling via flow-based, diffusion-based or autoregressive networks methods can be equivalently mapped to the analysis of a Bayes optimal denoising of a modified probability measure. Our findings demonstrate that these methods encounter difficulties in sampling stemming from the presence of a first-order phase transition along the algorithm's denoising path. Our conclusions go both ways: we identify regions of parameters where these methods are unable to sample efficiently, while that is possible using standard Monte Carlo or Langevin approaches. We also identify regions where the opposite happens: standard approaches are inefficient while the discussed generative methods work well. |
Davide Ghio · Yatin Dandi · Florent Krzakala · Lenka Zdeborová 🔗 |
-
|
Enhancing Diffusion-based Point Cloud Generation with Smoothness Constraint
(
Poster
)
>
link
Diffusion models have been popular for point cloud generation tasks. Existing works utilize the forward diffusion process to convert the original point distribution into a noise distribution and then learn the reverse diffusion process to recover the point distribution from the noise distribution. However, the reverse diffusion process can produce samples with non-smooth points on the surface because of the ignorance of the point cloud geometric properties. We propose alleviating the problem by incorporating the local smoothness constraint into the diffusion framework for point cloud generation. Experiments demonstrate the proposed model can generate realistic shapes and smoother point clouds, outperforming multiple state-of-the-art methods. |
Yukun Li · Liping Liu 🔗 |
-
|
Circumventing Concept Erasure Methods For Text-to-Image Generative Models
(
Poster
)
>
link
Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine seven recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety. |
Minh Pham · Kelly Marshall · Niv Cohen · Govind Mittal · Chinmay Hegde 🔗 |
-
|
Latent Diffusion for Document Generation with Sequential Decoding
(
Poster
)
>
link
We present a new document-generation model called LaDiDa, which stands for Latent Diffusion for Document Generation with Sequential Decoding. Large language models (LLMs) can create impressive texts, but the qualities of the documents degrade as the output lengthens. Over time, models struggle to maintain discourse coherence and desirable text dynamics, leading to rambling and repetitive results. This difficulty with long-range generation can often be attributed to the autoregressive training objective, which causes compounding errors over multiple steps. LaDiDa is a hierarchical model for improved long-text generation by decomposing the task on the document and sentence level. Our method is comprised of document-level diffusion and sentence-level decoding, where diffusion is used to globally and non-autoregressively plan sentences within a document and decoding is used to locally and sequentially generate those sentences. Compared to autoregressive models, LaDiDa is able to achieve high textual diversity and structural cohesion in text generation. |
Zihuiwen Ye · Elle Michelle Yang · Phil Blunsom 🔗 |
-
|
The Emergence of Reproducibility and Consistency in Diffusion Models
(
Poster
)
>
link
In this work, we uncover a distinct and prevalent phenomenon within diffusion models in contrast to most other generative models, which we refer to as "consistent model reproducibility''. To elaborate, our extensive experiments have consistently shown that when starting with the same initial noise input and sampling with a deterministic solver, diffusion models tend to produce nearly identical output content. This consistency holds true regardless of the choices of model architectures and training procedures.Additionally, our research has unveiled that this exceptional model reproducibility manifests in two distinct training regimes: (i) "memorization regime,'' characterized by a significantly overparameterized model which attains reproducibility mainly by memorizing the training data; (ii) "generalization regime,'' in which the model is trained on an extensive dataset, and its reproducibility emerges with the model's generalization capabilities. |
Huijie Zhang · Jinfan Zhou · Yifu Lu · Minzhe Guo · Liyue Shen · Qing Qu 🔗 |
-
|
Exploring Attribute Variations in Style-based GANs using Diffusion Models
(
Poster
)
>
link
Existing attribute editing methods treat semantic attributes as binary, resulting in a single edit per attribute. However, attributes such as eyeglasses, smiles, or hairstyles exhibit a vast range of diversity. In this work, we formulate the task of diverse attribute editing by modeling the multidimensional nature of attribute edits. This enables users to generate multiple plausible edits per attribute. We capitalize on disentangled latent spaces of pretrained GANs and train a Denoising Diffusion Probabilistic Model (DDPM) to learn the latent distribution for diverse edits. Specifically, we train DDPM over a dataset of edit latent directions obtained by embedding image pairs with a single attribute change. This leads to latent subspaces that enable diverse attribute editing. Applying diffusion in the highly compressed latent space allows us to model rich distributions of edits within limited computational resources. Through extensive qualitative and quantitative experiments conducted across a range of datasets, we demonstrate the effectiveness of our approach for diverse attributeediting. We also showcase the results of our method applied for 3D editing of various face attributes. |
Rishubh Parihar · Prasanna Balaji · Raghav Magazine · Sarthak Vora · Tejan Karmali · Varun Jampani · Venkatesh Babu R 🔗 |
-
|
LDM3D-VR: Latent Diffusion Model for 3D VR
(
Poster
)
>
link
Latent diffusion models have proven to be state-of-the-art in the creation and manipulation of visual outputs. However, as far as we know, the generation of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite of diffusion models targeting virtual reality development that includes LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD based on textual prompts and the upscaling of low-resolution inputs to high-resolution RGBD, respectively. Our models are fine-tuned from existing pretrained models on datasets containing panoramic/high-resolution RGB images, depth maps and captions. Both models are evaluated in comparison to existing related methods. |
Gabriela Ben-Melech Stan · Diana Wofk · Estelle Aflalo · Shao-Yen Tseng · zhipeng cai · Michael Paulitsch · VASUDEV LAL 🔗 |
-
|
DiffDock-Pocket: Diffusion for Pocket-Level Docking with Sidechain Flexibility
(
Poster
)
>
link
When a small molecule binds to a protein, the 3D structure of the protein and its function change. Understanding this process, called molecular docking, can be crucial in areas such as drug design. Recent learning-based attempts have shown promising results at this task, yet lack features that traditional approaches support. In this work, we close this gap by proposing DiffDock-Pocket, a diffusion-based docking algorithm that is conditioned on a binding target to predict ligand poses only in a specific binding pocket. On top of this, our model supports receptor flexibility and predicts the position of sidechains close to the binding site. Empirically, we improve the state-of-the-art in site-specific-docking on the PDBBind benchmark. Especially when using in-silico generated structures, we achieve more than twice the performance of current methods while being more than 20 times faster than other flexible approaches. Although the model was not trained for cross-docking to different structures, it yields competitive results in this task. |
Michael Plainer · Marcella Toth · Simon Dobers · Hannes Stärk · Gabriele Corso · Céline Marquet · Regina Barzilay 🔗 |
-
|
Probing Intersectional Biases in Vision-Language Models with Counterfactual Examples
(
Poster
)
>
link
While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes from existing datasets. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intserctional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race \& gender). We conduct extensive experiments using our generated dataset which reveal the intersectional social biases present in state-of-the-art VLMs. |
Phillip Howard · Avinash Madasu · Tiep Le · Gustavo Lujan-Moreno · VASUDEV LAL 🔗 |
-
|
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
(
Poster
)
>
link
The rapid advancement of generative models, facilitating the creation of hyper-realistic images from textual descriptions, has concurrently escalated critical societal concerns such as misinformation. Traditional fake detection mechanisms, although providing some mitigation, fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images, thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach, incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates near-perfect attribution accuracy with a minimal impact on output quality. We evaluate the robustness of our approach against various image post-processing manipulations typically executed by end-users. Through extensive evaluation of the Stable Diffusion models, our method presents a promising and novel avenue for accountable model distribution and responsible use. |
Changhoon Kim · Kyle Min · Maitreya Patel · Sheng Cheng · 'YZ' Yezhou Yang 🔗 |
-
|
$f$-GANs Settle Scores!
(
Poster
)
>
link
Generative adversarial networks (GANs) comprise a generator, trained to learn the underlying distribution of the desired data, and a discriminator, trained to distinguish real samples from those output by the generator. A majority of GAN literature focuses on understanding the optimality of the discriminator, typically under divergence minimization losses. In this paper, we propose a unified approach to analyzing the generator optimization through variational Calculus, uncovering links to score-based diffusion models. Considering $f$-divergence-minimizing GANs, we show that the optimal generator is the one that matches the score of its output distribution with that of the data distribution. The proposed approach serves to unify score-based training and existing $f$-GAN flavors, leveraging results from normalizing flows, while also providing explanations for empirical phenomena such as the stability of non-saturating GAN losses, or the state-of-the-art performance of discriminator guidance in diffusion models.
|
Siddarth Asokan · Nishanth Shetty · Aadithya Srikanth · Chandra Seelamantula 🔗 |
-
|
Latent Painter
(
Poster
)
>
link
Latent diffusers revolutionized the generative AI and inspired creative art. When denoising the latent, the predicted original image at each step collectively animates the formation. However, the animation is limited by the denoising nature of the diffuser, and only renders a sharpening process. This work presents Latent Painter, which uses the latent as the canvas, and the diffuser predictions as the plan, to generate painting animation. Latent Painter also transits one generated image to another, which can happen between images from two different sets of checkpoints. |
Shih-Chieh Su 🔗 |
-
|
LoRA can Replace Time and Class Embeddings in Diffusion Probabilistic Models
(
Poster
)
>
link
We propose LoRA modules as a replacement for the time and class embeddings of the U-Net architecture for diffusion probabilistic models. Our experiments on CIFAR-10 show that a score network trained with LoRA achieves competitive FID scores while being more efficient in memory compared to a score network trained with time and class embeddings. |
Joo Young Choi · Jaesung Park · Inkyu Park · Jaewoong Cho · Albert No · Ernest Ryu 🔗 |
-
|
Importance-Guided Diffusion
(
Poster
)
>
link
Conditional generative modeling via diffusion processes has emerged as an indispensable tool for advancing the fidelity and diversity of sample generation, pushing the boundaries of applications such as image synthesis, style transfer, and adaptive content generation. We present a plug-and-play approach to conditional diffusion which integrates seamlessly with existing unconditioned diffusion architectures. Our method, derived from relative entropy coding, recasts diffusion as an auxiliary variable importance sampling procedure and is able to influence the generative process without the need for gradient information or tampering with the network in any manner. Furthermore, this approach offers a principled mechanism to both quantify and adjust the degree of conditioning, enabling precise navigation across a large spectrum of generative outputs. Experimental results indicate that this technique produces meaningful conditional outputs while maintaining a relatively minimal increase in computational burden. |
Paris Flood · Pietro Lió 🔗 |
-
|
Bridging the Gap: Addressing Discrepancies in Diffusion Model Training for Classifier-Free Guidance
(
Poster
)
>
link
Diffusion models have emerged as a pivotal advancement in generative models, setting new standards to the quality of the generated instances. In the current paper we aim to underscore a discrepancy between conventional training methods and the desired conditional sampling behavior of these models. While the prevalent classifier-free guidance technique works well, it's not without flaws. At higher values for the guidance scale parameter $w$, we often get out of distribution samples and mode collapse, whereas at lower values for $w$ we may not get the desired specificity. To address these challenges, we introduce an updated loss function that better aligns training objectives with sampling behaviors. Experimental validation with FID scores on CIFAR-10 elucidates our method's ability to produce higher quality samples with fewer sampling timesteps, and be more robust to the choice of guidance scale $w$. We also experiment with fine-tuning Stable Diffusion on the proposed loss, to provide early evidence that large diffusion models may also benefit from this refined loss function.
|
Niket Patel · Luis Salamanca · Luis Barba 🔗 |
-
|
Particle Guidance: non-I.I.D. Diverse Sampling with Diffusion Models
(
Poster
)
>
link
In light of the widespread success of generative models, a significant amount of research has gone into speeding up their sampling time. However, generative models are often sampled multiple times to obtain a diverse set incurring in a cost that is orthogonal to sampling time. We tackle the question of how to improve diversity and sample efficiency by moving beyond the common assumption of independent samples. For this we propose particle guidance, an extension of diffusion-based generative sampling where a joint-particle time-evolving potential enforces diversity. We analyze theoretically the joint distribution that particle guidance generates, its implications on the choice of potential, and the connections with methods in other disciplines. Empirically, we test the framework both in the setting of conditional image generation, where we are able to increase diversity without affecting quality, and molecular conformer generation, where we reduce the state-of-the-art median error by 13% on average. |
Gabriele Corso · Yilun Xu · Valentin De Bortoli · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution
(
Poster
)
>
link
Despite their groundbreaking performance for many generative modeling tasks, diffusion models have fallen short on discrete data domains such as natural language. Crucially, standard diffusion models rely on the well-established theory of score matching, but efforts to generalize this to discrete structures have not yielded the same empirical gains. In this work, we bridge this gap by proposing score entropy, a novel discrete score matching loss that is more stable than existing methods, forms an ELBO for maximum likelihood training, and can be efficiently optimized with a denoising variant. Combined with architectural improvements, we scale to the GPT-2 language modeling experiments, achieving, for the first time, highly competitive performance with a non-autoregressive model. When comparing similarly sized-architectures to the GPT-2 baseline, our score entropy discrete diffusion (SEDD) model attains comparable zero-shot perplexities despite reporting an upper bound (within $+15$ percent and sometimes outperforming the baseline), can generate better distributions samples faster ($4\times$ lower generative perplexity when matching function evaluations and $16\times$ fewer function evaluations when matching generative perplexity compared to analytic sampling), and enables arbitrary infilling beyond standard autoregressive left to right prompting.
|
Aaron Lou · Chenlin Meng · Stefano Ermon 🔗 |
-
|
Adversarial Estimation of Topological Dimension with Harmonic Score Maps
(
Poster
)
>
link
Quantification of the number of variables needed to locally explain complex data is often the first step to better understanding it. Existing techniques from intrinsic dimension estimation leverage statistical models to glean this information from samples within a neighborhood. However, existing methods often rely on well-picked hyperparameters and ample data as manifold dimension and curvature increases. Leveraging insight into the fixed point of the score matching objective as the score map is regularized by its Dirichlet energy, we show that it is possible to retrieve the topological dimension of the manifold learned by the score map. We then introduce a novel method to measure the learned manifold's topological dimension (i.e., local intrinsic dimension) using adversarial attacks, thereby generating useful interpretations of the learned manifold. |
Eric Yeats · Cameron Darwin · Frank Liu · Hai Li 🔗 |
-
|
Generative Fractional Diffusion Models
(
Poster
)
>
link
We generalize the continuous time framework for score-based generative models from an underlying Brownian motion (BM) to an approximation of fractional Brownian motion (FBM). We derive a continuous reparameterization trick and the reverse time model by representing FBM as a stochastic integral over a family of Ornstein-Uhlenbeck processes to define generative fractional diffusion models (GFDM) with driving noise converging to a non-Markovian process of infinite quadratic variation. The Hurst index $H\in(0,1)$ of FBM enables control of the roughness of the distribution transforming path. To the best of our knowledge, this is the first attempt to build a generative model upon a stochastic process with infinite quadratic variation.
|
Gabriel Nobis · Marco Aversa · Maximilian Springenberg · Michael Detzel · Stefano Ermon · Shinichi Nakajima · Roderick Murray-Smith · Sebastian Lapuschkin · Christoph Knochenhauer · Luis Oala · Wojciech Samek
|
-
|
Text-Aware Diffusion Policies
(
Poster
)
>
link
Diffusion models scaled to massive datasets have demonstrated powerful underlying unification capabilities between the language modality and pixel space, as convincingly evidenced by high-quality text-to-image synthesis that delight and astound. In this work, we interpret agents interacting within a visual reinforcement learning setting as trainable video renderers, where the output video is simply frames stitched together across sequential timesteps. Then, we propose Text-Aware Diffusion Policies (TADPols), which uses large-scale pretrained models, particularly text-to-image diffusion models, to train policies that are aligned with natural language text inputs. As the behavior represented within a policy naturally learns to align with the reward function utilized during optimization, we propose generating the reward signal for a reinforcement learning agent as the similarity between a provided text description and the frames the agent produces from its interactions. Furthermore, rendering the video produced by an agent during inference can be treated as a form of text-to-video generation, where the video has the added bonus of always being smooth and consistent with respect to the environmental specifications. Additionally, when the diffusion model is kept frozen, this enables the investigation of how well a large-scale model pretrained only on static image and textual data is able to understand temporally extended behaviors and actions. We conduct experiments on a variety of locomotion experiments across multiple subjects, and demonstrate that agents can be trained using the unified understanding of vision and language captured within large-scale pretrained diffusion models to not only synthesize videos that correspond with provided text, but also learn to perform the motion itself as autonomous agents. |
Calvin Luo · Chen Sun 🔗 |
-
|
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images
(
Poster
)
>
link
We assemble a dataset of creative commons licensed images and train a set of open diffusion models on that dataset that are competitive with Stable Diffusion 2. This task presents two challenges: high-resolution CC images 1) lack the captions necessary to train text-to-image generative models, and 2) are relatively scarce (∼70 million, compared to LAION’s ∼2 billion). In turn, we first describe telephoning, a type of transfer learning, which we use to produce a dataset of high-quality synthetic captions paired with curated CC images. Second, we propose a more efficient training recipe to explore this question of data scarcity. Third, we implement a variety of ML-systems optimizations that achieve ∼3X training speed-ups. We train multiple versions Stable Diffusion 2 (SD2), each on a differently sized subsets of LAION-2B, and find we can successfully train using <3% of LAION-2B. Our largest model, dubbed CommonCanvas, achieves comparable performance to SD2 on human evaluation, even though we only use a CC dataset that is <3% the size of LAION and synthetic captions for training. We release our model, data, and code at [REDACTED] |
Aaron Gokaslan · A. Feder Cooper · Jasmine Collins · Landan Seguin · Austin Jacobson · Mihir Patel · Jonathan Frankle · Cory Stephenson · Volodymyr Kuleshov 🔗 |
-
|
Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model
(
Poster
)
>
link
Latent diffusion models (LDMs) exhibit an impressive ability to produce realistic images, yet the inner workings of these models remain mysterious. Even when trained purely on images without explicit depth information, they typically output coherent pictures of 3D scenes. In this work, we investigate a basic interpretability question: does an LDM create and use an internal representation of simple scene geometry? Using linear probes, we find evidence that the internal activations of the LDM encode linear representations of both 3D depth and a salient-object / background distinction. These representations appear surprisingly early in the denoising process - well before a human can easily make sense of the noisy images. Intervention experiments further indicate these representations play a causal role in image synthesis, and may be used for simple high-level editing of LDM's outputs. |
Yida Chen · Fernanda Viégas · Martin Wattenberg 🔗 |
-
|
DiffusionShield: A Watermark for Data Copyright Protection against Generative Diffusion Models
(
Poster
)
>
link
Generative Diffusion Models (GDMs) have showcased their remarkable capabilities in image learning and generation. Yet, their unrestrained use raised concerns about copyright protection, especially among artists, as it can replicate unique creative works without authorization. To address the challenges, we propose a watermark scheme, DiffusionShield, against GDMs. It protects images against infringement by encoding ownership information into an imperceptible watermark injected to the images. The watermark can be easily learned by GDMs and reproduced in their generated images. By detecting the watermark in generated data, infringement can be exposed with evidence. Benefiting from the uniformity of the watermarks and the joint optimization method, DiffusionShield ensures low distortion of the image, high detection accuracy, and the ability to embed lengthy messages. Experiments has validated DiffusionShield’s efficacy in defending against GDMs infringements and its superiority over conventional watermarking techniques. |
Yingqian Cui · Jie Ren · Han Xu · Pengfei He · Hui Liu · Lichao Sun · Yue XING · Jiliang Tang 🔗 |
-
|
Deep Networks as Denoising Algorithms: Sample-Efficient Learning of Diffusion Models in High-Dimensional Graphical Models
(
Poster
)
>
link
We investigate the efficiency of deep neural networks for approximating scoring functions in diffusion-based generative modeling. While existing approximation theories leverage the smoothness of score functions, they suffer from the curse of dimensionality for intrinsically high-dimensional data. This limitation is pronounced in graphical models such as Markov random fields, where the approximationefficiency of score functions remains unestablished.To address this, we note score functions can often be well-approximated in graphical models through variational inference denoising algorithms. Furthermore, these algorithms can be efficiently represented by neural networks. We demonstrate this through examples, including Ising models, conditional Ising models, restricted Boltzmann machines, and sparse encoding models. Combined with off-the-shelf discretization error bounds for diffusion-based sampling, we provide an efficient sample complexity bound for diffusion-based generative modeling when the score function is learned by deep neural networks. |
Song Mei · Yuchen Wu 🔗 |
-
|
Drag-guided diffusion models for vehicle image generation
(
Poster
)
>
link
Denoising diffusion models trained at web-scale have revolutionized image generation. The application of these tools to engineering design holds promising potential but is currently limited by their inability to understand and adhere to concrete engineering constraints. In this paper, we take a step toward the goal of incorporating quantitative constraints into diffusion models by proposing physics-based guidance, which enables the optimization of a performance metric (as predicted by a surrogate model) during the generation process. As a proof-of-concept, we add drag guidance to Stable Diffusion, which allows this tool to generate images of novel vehicles while simultaneously minimizing their predicted drag coefficients. |
Nikos Arechiga · Frank Permenter · Binyang Song · Chenyang Yuan 🔗 |
-
|
Generate Your Own Scotland: Satellite Image Generation Conditioned on Maps
(
Poster
)
>
link
Despite recent advancements in image generation, diffusion models still remain largely underexplored in Earth Observation. In this paper we show that state-of-the-art pretrained diffusion models can be conditioned on cartographic data to generate realistic satellite images. For this purpose, we provide two large datasets of paired maps and satellite views over the region of Mainland Scotland and the Central Belt. We train a ControlNet model and qualitatively evaluate the results, demonstrating that both image quality and map fidelity are possible. Additionally, we explore its use for the reconstruction of historical maps. Finally, we provide some insights on the opportunities and challenges of applying these models for remote sensing. |
Miguel Espinosa Minano · Elliot Crowley 🔗 |
-
|
DFU: scale-robust diffusion model for zero-shot super-resolution image generation
(
Poster
)
>
link
Diffusion generative models have achieved remarkable success in generating images with a fixed resolution. However, existing models have limited ability to generalize to different resolutions when training data at those resolutions are not available. Leveraging techniques from operator learning, we present a novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the score operator by combining both spatial and spectral information at multiple resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1) simultaneously training on multiple resolutions improves FID over training at any single fixed resolution; 2) DFU generalizes beyond its training resolutions, allowing for coherent, high-fidelity generation at higher-resolutions with the same model, i.e. zero-shot super-resolution image-generation; 3) we propose a fine-tuning strategy to further enhance the zero-shot super-resolution image-generation capability of our model, leading to a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no other method can come close to achieving. |
Alexander Havrilla · Kevin Rojas · Wenjing Liao · Molei Tao 🔗 |
-
|
Diffusion Models without Attention
(
Poster
)
>
link
Advances in high-fidelity image generation have been spearheaded by denoising diffusion probabilistic models (DDPMs). However, there remain considerable computational challenges when scaling current DDPM architectures to high-resolutions, due to the use of attention either in UNet architectures or Transformer variants. To make models tractable, it is common to employ lossy compression techniques in hidden space, such as patchifying, which trade-off representational capacity for efficiency. We propose Diffusion State Space Model (DiffuSSMs), an architecture that replaces attention with a more efficient state space model backbone. The model avoids global compression which enables longer, more fine-grained image representation in the diffusion process. Our validation on ImageNet indicates superior performance in terms of FiD and Inception Score at reduced total FLOP usage compared to previous diffusion models using attention. |
Nathan Yan · Jiatao Gu · Alexander Rush 🔗 |
-
|
Effective Data Augmentation With Diffusion Models
(
Poster
)
>
link
Data augmentation is one of the most prevalent tools in deep learning, underpinning many recent advances, including those from classification, generative models, and representation learning. The standard approach to data augmentation combines simple transformations like rotations and flips to generate new images from existing ones. However, these new images lack diversity in semantic ways the data varies. Current augmentations cannot alter the high-level semantic attributes, such as animal species present in a scene, to enhance the diversity of data. We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models. Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples. We observe an improvement in accuracy up to 24.2\% on six standardized few-shot image classification tasks, and see \textbf{larger gains for the more fine-grain concepts}. |
Brandon Trabucco · Kyle Doherty · Max Gurinas · Russ Salakhutdinov 🔗 |
-
|
DYffusion: A Dynamics-informed Diffusion Model for Spatiotemporal Forecasting
(
Poster
)
>
link
While diffusion models can successfully generate data and make predictions, they are predominantly designed for static images. We propose an approach for training diffusion models for probabilistic dynamics forecasting that leverages the temporal dynamics encoded in the data, directly coupling it with the diffusion steps in the network. We train a stochastic, time-conditioned interpolator and a forecaster network that mimic the forward and reverse processes of conventional diffusion models, respectively. This design choice naturally encodes multi-step and long-range forecasting capabilities, allowing for highly flexible, continuous-time sampling trajectories and the ability to trade-off performance with accelerated sampling at inference time. In addition, the dynamics-informed diffusion process imposes a strong inductive bias, allowing for improved computational efficiency compared to traditional Gaussian noise-based diffusion models. Our approach performs competitively on probabilistic evaluations for forecasting complex dynamics in sea surface temperatures, Navier-Stokes flows, and spring mesh systems. |
Salva Rühling Cachay · Bo Zhao · Hailey Joren · Rose Yu 🔗 |
-
|
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
(
Poster
)
>
link
The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision.Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions.However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding.To quantify the ability of T2I models in learning and synthesizing novel visual concepts, we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts.Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in ground truth images.We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions.Our human study shows that CCD is highly correlated with human understanding of concepts.Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. |
Maitreya Patel · Tejas Gokhale · Chitta Baral · 'YZ' Yezhou Yang 🔗 |
-
|
Functional Flow Matching
(
Poster
)
>
link
We propose Functional Flow Matching (FFM), a function-space generative model that generalizes the recently-introduced Flow Matching model to operate directly in infinite-dimensional spaces. Our approach works by first defining a path of probability measures that interpolates between a fixed Gaussian measure and the data distribution, followed by learning a vector field on the underlying space of functions that generates this path of measures. Our method does not rely on likelihoods or simulations, making it well-suited to the function space setting. We provide both a theoretical framework for building such models and an empirical evaluation of our techniques. We demonstrate through experiments on synthetic and real-world benchmarks that our proposed FFM method outperforms several recently proposed function-space generative models. |
Gavin Kerrigan · Giosue Migliorini · Padhraic Smyth 🔗 |
-
|
Diffusion-Augmented Neural Processes
(
Poster
)
>
link
Over the last few years, Neural Processes have become a useful modelling tool in many application areas, such as healthcare and climate sciences, in which data are scarce and prediction uncertainty estimates are indispensable. However, the current state of the art in the field (AR CNPs; Bruinsma et al., 2023) presents a few issues that prevent its widespread deployment. This work proposes an alternative, diffusion-based approach to NPs which, through conditioning on noised datasets, addresses many of these limitations, whilst also exceeding SOTA performance. |
Lorenzo Bonito · James Requeima · Aliaksandra Shysheya · Richard Turner 🔗 |
-
|
Leveraging Diffusion Disentangled Representations to Mitigate Shortcuts in Underspecified Visual Tasks
(
Poster
)
>
link
Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to shortcut learning phenomena, where a model may rely on erroneous, easy-to-learn, cues while ignoring reliable ones. In this work, we propose an ensemble diversification framework exploiting the generation of synthetic counterfactuals using Diffusion Probabilistic Models (DPMs). We discover that DPMs have the inherent capability to represent multiple visual cues independently, even when they are largely correlated in the training data. We leverage this characteristic to encourage model diversity and empirically show the efficacy of the approach with respect to several diversification objectives. We show that diffusion-guided diversification can lead models to avert attention from shortcut cues, achieving ensemble diversity performance comparable to previous methods requiring additional data collection. |
Luca Scimeca · Alexander Rubinstein · Armand Nicolicioiu · Damien Teney · Yoshua Bengio 🔗 |
-
|
Score Normalization for a Faster Diffusion Exponential Integrator Sampler (DEIS)
(
Poster
)
>
link
Recently, Zhang et al have proposed the Diffusion Exponential Integrator Sampler (DEIS) for fast generation of samples from Diffusion Models. It leverages the semi-linear nature of the probability flow ordinary differential equation (ODE) in order to greatly reduce integration error and improve generation quality at low numbers of function evaluations (NFEs). Key to this approach is the score function reparameterisation, which reduces the integration error incurred from using a fixed score function estimate over each integration step. The original authors use the default parameterisation used by models trained for noise prediction -- multiply the score by the standard deviation of the conditional forward noising distribution. We find that although the mean absolute value of this score parameterisation is close to constant for a large portion of the reverse sampling process, it changes rapidly at the end of sampling. As a simple fix, we propose to instead reparameterise the score (at inference) by dividing it by the average absolute value of previous score estimates at that time step collected from offline high NFE generations. We find that our score normalisation (DEIS-SN) consistently improves FID compared to vanilla DEIS, showing an FID improvement from 6.44 to 5.57 at 10 NFEs for our CIFAR-10 experiments. We will make our code public upon publication. |
Guoxuan Xia · Duolikun Danier · Ayan Das · Stathi Fotiadis · Farhang Nabiei · Ushnish Sengupta · Alberto Bernacchia 🔗 |
-
|
AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
(
Poster
)
>
link
Aligning agent behaviors with diverse human preferences remains a challenging problem in reinforcement learning (RL), owing to the inherent abstractness and mutability of human preferences. To address these issues, we propose AlignDiff, a novel framework that leverages RLHF to quantify human preferences, covering abstractness, and utilizes them to guide diffusion planning for zero-shot behavior customizing, covering mutability. AlignDiff can accurately match user-customized behaviors and efficiently switch from one to another. To build the framework, we first establish the multi-perspective human feedback datasets, which contain comparisons for the attributes of diverse behaviors, and then train an attribute strength model to predict quantified relative strengths. After relabeling behavioral datasets with relative strengths, we proceed to train an attribute-conditioned diffusion model, which serves as a planner with the attribute strength model as a director for preference aligning at the inference phase. We evaluate AlignDiff on various locomotion tasks and demonstrate its superior performance on preference matching, switching, and covering compared to other baselines. Its capability of completing unseen downstream tasks under human instructions also showcases the promising potential for human-AI collaboration. More visualization videos are released on https://aligndiff.github.io/. |
Zibin Dong · Yifu Yuan · Jianye Hao · Fei Ni · Yao Mu · YAN ZHENG · Yujing Hu · Tangjie Lv · Changjie Fan · ZHIPENG HU 🔗 |
-
|
LC-SD: Realistic Endoscopic Image Generation with Stable Diffusion and ControlNet
(
Poster
)
>
link
Computer-assisted surgical systems provide support information to the surgeon, which can improve the execution and overall outcome of the procedure. These systems are based on deep learning models that are trained on complex and challenging-to-annotate data. Generating synthetic data can overcome these limitations, but it is necessary to reduce the domain gap between real and synthetic data. We propose a method for image-to-image translation based on a Stable Diffusion model, which generates realistic images starting from synthetic data. Compared to previous works, the proposed method is better suited for clinical application as it requires a much smaller amount of input data and allows finer control over the generation of details by introducing different variants of supporting control networks. The proposed method is applied in the context of laparoscopic cholecystectomy, using synthetic and real data from public datasets. It achieves a mean Intersection over Union of 69.76%, significantly improving the baseline results (69.76% vs. 42.21%). The proposed method for translating synthetic images into images with realistic characteristics will enable the training of deep learning methods that can generalize optimally to real-world contexts, thereby improving computer-assisted intervention guidance systems. |
Joanna Kaleta · Diego Dall'Alba · Szymon Plotka · Przemyslaw Korzeniowski 🔗 |
-
|
Faster Training of Diffusion Models and Improved Density Estimation via Parallel Score Matching
(
Poster
)
>
link
In Diffusion Probabilistic Models (DPMs), the task of modeling the score evolution via a single time-dependent neural network necessitates extended training periods and may potentially impede modeling flexibility and capacity. To counteract these challenges, we propose leveraging the independence of learning tasks at different time points inherent to DPMs. More specifically, we partition the learning task by utilizing independent networks, each dedicated to learning the evolution of scores within a specific time sub-interval. Further, inspired by residual flows, we extend this strategy to its logical conclusion by employing separate networks to independently model the score at each individual time point. As empirically demonstrated on synthetic and image datasets, our approach not only significantly accelerates the training process by introducing an additional layer of parallelization atop data parallelization, but it also enhances density estimation performance when compared to the conventional training methodology for DPMs. |
Etrit Haxholli · Marco Lorenzi 🔗 |
-
|
Masked Multi-time Diffusion for Multi-modal Generative Modeling
(
Poster
)
>
link
Multi-modal data is ubiquitous, and models to learn a joint representation of all modalities have flourished. However, existing approaches suffer from a coherence-quality tradeoff, where generation quality comes at the expenses of generative coherence across modalities, and vice versa. To overcome these limitations, we propose a novel method that uses a set of independently trained, uni-modal, deterministic autoencoders. Individual latent variables are concatenated and fed to a masked diffusion model to enable generative modeling. We also introduce a new multi-time training method to learn the conditional score network for multi-modal diffusion. Empirically, our methodology substantially outperforms competitors in both generation quality and coherence. |
Mustapha BOUNOUA · Giulio Franzese · Pietro Michiardi 🔗 |
-
|
MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
(
Poster
)
>
link
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. As the diffusion model generates 3D MOF structures by predicting scores in E(3), we employ equivariant graph neural networks that respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations. |
Xiang Fu · Tian Xie · Andrew Rosen · Tommi Jaakkola · Jake Smith 🔗 |
-
|
Successfully Applying Lottery Ticket Hypothesis to Diffusion Model
(
Poster
)
>
link
Despite the success of diffusion models, thetraining and inference of diffusion models are notoriously expensive due to the long chain of the reverse process. In parallel, the Lottery Ticket Hypothesis (LTH) claims that there exists winning tickets (i.e., aproperly pruned sub-network together with original weight initialization) that canachieve performance competitive to the original dense neural network when trained in isolation. In this work, we for the first time apply LTH to diffusion models. We empirically find subnetworks at sparsity $90%-99%$ without compromisingperformance for denoising diffusion probabilistic models on benchmarks (CIFAR-10, CIFAR-100, MNIST). Moreover, existing LTH works identify the subnetworks with a unified sparsity along different layers. We observe that the similarity between two winning tickets of a model varies from block to block. Specifically, the upstream layers from two winning tickets for a model tend to be more similar than the downstream layers. Therefore, we propose to find the winning ticket with varying sparsity along different layers in the model. Experimentalresults demonstrate that our method can find sparser sub-models that require less memory for storageand reduce the necessary number of FLOPs. Codes are available at https://anonymous.4open.science/r/Lottery-Ticket-to-DDPM-2D79/.
|
Chao Jiang · Bo Hui · Bohan Liu · Da Yan 🔗 |
-
|
Strong generalization in diffusion models
(
Poster
)
>
link
High-quality samples generated with score-based reverse diffusion algorithms provide evidence that deep neural networks (DNNs) trained for denoising can learn high-dimensional densities, despite the curse of dimensionality. However, recent reports of memorization of the training set raise the question of whether these networks are learning the ``true'' continuous density of the data. Here, we show that two denoising DNNs trained on non-overlapping subsets of a dataset learn nearly the same score function, and thus the same density, with a surprisingly small number of training images. This strong generalization demonstrates an alignment of powerful inductive biases in the DNN architecture and/or training algorithm with properties of the data distribution. Our method is general and can be applied to assess generalization vs.\ memorization in any generative model. |
Zahra Kadkhodaie · Florentin Guth · Eero Simoncelli · Stephane Mallat 🔗 |
-
|
Diffusion Models With Learned Adaptive Noise Processes
(
Poster
)
>
link
Diffusion models have gained traction as powerful algorithms for synthesizing high-quality images. Central to these algorithms is the diffusion process, which maps data to noise according to equations inspired by thermodynamics, and which can significantly impact performance.In this work, we explore whether a diffusion process can be learned from data.We propose multivariate learned adaptive noise (MuLAN), a learned diffusion process that applies Gaussian noise at different rates across an image.Our method consists of three components — a multivariate noise schedule, instance-conditional diffusion, and auxiliary variables — which ensure that the learning objective is no longer invariant to the choice of noise schedule as in previous works.Our work is grounded in Bayesian inference and casts the learned diffusion process as an approximate variational posterior that yields a tighter lower bound on the marginal likelihood.Empirically, MuLAN significantly improves likelihood estimation on CIFAR10 and ImageNet, and achieves 2x faster convergence to state-of-the-art performance compared to classical diffusion. |
Subham Sahoo · Aaron Gokaslan · Christopher De Sa · Volodymyr Kuleshov 🔗 |
-
|
Generalized Contrastive Divergence: Joint Training of Energy-Based Model and Diffusion Model through Inverse Reinforcement Learning
(
Poster
)
>
link
We present Generalized Contrastive Divergence (GCD), a novel objective function for training an energy-based model (EBM) and a sampler simultaneously. GCD generalizes Contrastive Divergence, a celebrated algorithm for training EBM, by replacing MCMC distribution with a trainable sampler, such as a diffusion model. In GCD, the joint training of EBM and a diffusion model is formulated as a minimax problem, which reaches an equilibrium when both models converge to the data distribution. The minimax learning with GCD bears interesting equivalence to inverse reinforcement learning, where the energy corresponds to negative reward, the diffusion model is a policy, and the real data is expert demonstrations. We present preliminary yet promising results showing that the joint training is beneficial for both EBM and a diffusion model. Particularly, GCD learning can be employed to fine-tune a diffusion model to boost its sample quality. |
Sangwoong Yoon · Dohyun Kwon · Himchan Hwang · Yung-Kyun Noh · Frank Park 🔗 |
-
|
Diffusing More Objects for Semi-Supervised Domain Adaptation with Less Labeling
(
Poster
)
>
link
For object detection, it is possible to view the prediction of bounding boxes as a reverse diffusion process, where the random bounding boxes are iteratively refined as a denoising step, conditioned on the image using a diffusion model. We propose a stochastic accumulator function that starts each run with random bounding boxes and combines the slightly different predictions. We empirically verify that this improves detection performance. The improved detections are leveraged on unlabelled images, as weighted pseudo-labels for semi-supervised learning. We evaluate the method on a challenging out-of-domain test set. Our method brings significant improvements and is on par with human-selected pseudo-labels, while not requiring any human involvement. |
Leander van den Heuvel · Gertjan Burghouts · David Zhang · Gwenn Englebienne · Sabina van Rooij 🔗 |
-
|
Aligning Optimization Trajectories with Diffusion Models for Constrained Design Generation
(
Poster
)
>
link
Generative models have had a profound impact on vision and language, paving the way for a new era of multimodal generative applications. Yet, challenges persist in constrained environments, such as engineering and science, where data is limited and precision is crucial. We introduce Diffusion Optimization Models (DOM) and Trajectory Alignment (TA), a regularization technique aligning diffusion model sampling with physics-based optimization. By significantly improving performance and inference efficiency, DOM enables us to generate high-quality designs in just a few steps and guide them toward regions of high performance and manufacturability, paving the way for the widespread application of generative models in large-scale data-driven design. |
Giorgio Giannone · Akash Srivastava · Ole Winther · Faez Ahmed 🔗 |
-
|
TADA: Timestep-Aware Data Augmentation for Diffusion Models
(
Poster
)
>
link
Simply applying augmentation techniques to generative models can lead to a distribution shift problem, producing unintended augmented-like output samples. While this issue has been actively studied in generative adversarial networks (GANs), little attention has been paid to diffusion models despite their widespread use. In this work, we conduct the first comprehensive study of data augmentation for diffusion models, primarily investigating the relationship between distribution shifts and data augmentation. Our study reveals that distribution shifts in diffusion models originate exclusively from specific timestep intervals, rather than from the entire timesteps. Based on these findings, we introduce a simple yet effective data augmentation strategy that flexibly adjusts the augmentation strength depending on timesteps. Experiments demonstrate that our simple data augmentation pipeline can improve the generation quality of diffusion models, especially in data-limited settings. We expect that our data augmentation method can benefit various diffusion model designs and tasks across a wide scope of applications. |
NaHyeon Park · Kunhee Kim · Song Park · Jung-Woo Ha · Hyunjung Shim 🔗 |
-
|
Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints
(
Poster
)
>
link
Automatic layout generation is a fundamental problem in graphic design. Although recent diffusion-based models have achieved state-of-the-art FID scores, they underperform on alignment compared to earlier transformer-based models. In this work, we propose the LAyout Constraint diffusion modEl (LACE), a unified model for unconditional and conditional layout generation tasks in a continuous space. Compared with existing methods that use discrete diffusion models, continuous state space enables the incorporation of aesthetic constraint functions in training for enhanced visual quality. For conditional generation, LACE incorporates layout conditions via masked input throughout the training and testing phases. Experiment results show that LACE outperforms existing state-of-the-art baselines and produces visually plausible layouts. |
Jian Chen · Ruiyi Zhang · Yufan Zhou · Rajiv Jain · Zhiqiang Xu · Ryan Rossi · Changyou Chen 🔗 |
-
|
Manifold Diffusion Fields
(
Poster
)
>
link
We present Manifold Diffusion Fields (MDF), an approach that unlocks learning of diffusion models of data in general non-Euclidean geometries. Leveraging insights from spectral geometry analysis, we define an intrinsic coordinate system on the manifold via the eigen-functions of the Laplace-Beltrami Operator. MDF represents functions using an explicit parametrization formed by a set of multiple input-output pairs. Empirical results on multiple datasets and manifolds including challenging scientific problems like weather prediction or molecular conformation show that MDF can capture distributions of such functions with better diversity and fidelity than previous approaches. |
Ahmed Elhag · Yuyang Wang · Joshua Susskind · Miguel Angel Bautista 🔗 |
-
|
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models
(
Poster
)
>
link
Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks.However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact.To address this, here we present a novel grounding-guided video-to-video translation framework called Ground-A-Video for multi-attribute video editing.Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings.Central to our method is the introduction of Cross-Frame Gated Attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and optical flow guided inverted latents smoothing.Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.Further results are provided at our project page (http://ground-a-video.github.io). |
Hyeonho Jeong · Jong Chul Ye 🔗 |