Cryo-electron microscopy (cryo-EM) is unique among tools in structural biology in its ability to image large, dynamic protein complexes. Key to this ability are image processing algorithms for heterogeneous cryo-EM reconstruction, including recent deep learning-based approaches. The state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework to learn a continuous distribution of protein structures from single particle cryo-EM imaging data. While cryoDRGN is able to model complex structural motions, in practice, the Gaussian prior distribution of the VAE fails to match the aggregate approximate posterior, especially for multi-modal distributions (e.g. compositional heterogeneity). Here, we train a diffusion model as an expressive, learnable prior for cryoDRGN. We show the ability to sample from the model on two synthetic and two real datasets, where samples accurately follow the data distribution unlike samples from the VAE prior distribution. Our approach learns a high-quality generative model over molecular configurations directly from cryo-EM imaging data. We also demonstrate how the diffusion model prior can be leveraged for fast latent space traversal and interpolation between states of interest. By learning an accurate model of the data distribution, our method unlocks tools in generative modeling, sampling, and distribution analysis for heterogeneous cryo-EM ensembles.