We propose a parametric forward model for single particle cryo-electron microscopy (cryo-EM), and employ stochastic variational inference to infer posterior distributions of the physically interpretable latent variables. Our novel cryo-EM forward model accounts for the biomolecular configuration (via spatial coordinates of pseudo-atoms, in contrast with traditional voxelized representations) the global 3D pose, the effect of the microscope (contrast transfer function's defocus parameter), and noise. To capture heterogeneity, we use the anisotropic network model (ANM), a Gaussian in the space of atomic coordinates. We perform experiments on synthetic data and show that the posterior of the scalar component along the lowest ANM mode and the angle of 2D in-plane pose can be jointly inferred with deep neural networks. We also demonstrate Fourier frequency marching in the simulation and likelihood during training, without retraining the neural networks that characterize the variational posterior.