A Unification of Discrete, Gaussian, and Simplicial Diffusion
Abstract
To model discrete sequences such as DNA, proteins, and language using diffusion, practitioners must choose between three major methods: diffusion in discrete space, Gaussian diffusion in Euclidean space, or diffusion on the simplex. Despite their shared goal, these models have disparate algorithms, theoretical structures, and strengths. Ideally we could see each of these models as instances of the same underlying framework, and practitioners could seamlessly transition between the domains to fit their applications. However previous theories have only considered connections in special cases. Here we unify all three methods of discrete diffusion as different parameterizations of the same underlying process: the Wright-Fisher population genetics model. We find simplicial and Gaussian diffusion as two large-population limits. Our theory formally connects the likelihoods and hyperparameters of these models. Finally, we relieve the practitioner of balancing model trade-offs by demonstrating it is possible to train a single model that can perform diffusion in any of these three domains at test time. In a proof of concept result, we show that we can train models on multiple domains at once that are competitive with models trained on any individual domain.