Timezone: »

A generative nonparametric Bayesian model for whole genomes
Alan Amin · Eli N Weinstein · Debora Marks

Wed Dec 08 04:30 PM -- 06:00 PM (PST) @ None #None

Generative probabilistic modeling of biological sequences has widespread existing and potential use across biology and biomedicine, particularly given advances in high-throughput sequencing, synthesis and editing. However, we still lack methods with nucleotide resolution that are tractable at the scale of whole genomes and that can achieve high predictive accuracy in theory and practice. In this article we propose a new generative sequence model, the Bayesian embedded autoregressive (BEAR) model, which uses a parametric autoregressive model to specify a conjugate prior over a nonparametric Bayesian Markov model. We explore, theoretically and empirically, applications of BEAR models to a variety of statistical problems including density estimation, robust parameter estimation, goodness-of-fit tests, and two-sample tests. We prove rigorous asymptotic consistency results including nonparametric posterior concentration rates. We scale inference in BEAR models to datasets containing tens of billions of nucleotides. On genomic, transcriptomic, and metagenomic sequence data we show that BEAR models provide large increases in predictive performance as compared to parametric autoregressive models, among other results. BEAR models offer a flexible and scalable framework, with theoretical guarantees, for building and critiquing generative models at the whole genome scale.

Author Information

Alan Amin (Harvard University)
Eli N Weinstein (Harvard)
Debora Marks

More from the Same Authors

  • 2021 : Bayesian Data Selection »
    Eli N Weinstein · Jeffrey Miller
  • 2021 : Bayesian Data Selection »
    Eli N Weinstein
  • 2021 Workshop: Learning Meaningful Representations of Life (LMRL) »
    Elizabeth Wood · Adji Bousso Dieng · Aleksandrina Goeva · Anshul Kundaje · Barbara Engelhardt · Chang Liu · David Van Valen · Debora Marks · Edward Boyden · Eli N Weinstein · Lorin Crawford · Mor Nitzan · Romain Lopez · Tamara Broderick · Ray Jones · Wouter Boomsma · Yixin Wang
  • 2021 : Multimodal Single-Cell Data Integration + Q&A »
    Daniel Burkhardt · Smita Krishnaswamy · Malte Luecken · Debora Marks · Angela Pisco · Bastian Rieck · Jian Tang · Alexander Tong · Fabian Theis · Guy Wolf
  • 2020 Workshop: Learning Meaningful Representations of Life (LMRL.org) »
    Elizabeth Wood · Debora Marks · Ray Jones · Adji Bousso Dieng · Alan Aspuru-Guzik · Anshul Kundaje · Barbara Engelhardt · Chang Liu · Edward Boyden · Kresten Lindorff-Larsen · Mor Nitzan · Smita Krishnaswamy · Wouter Boomsma · Yixin Wang · David Van Valen · Orr Ashenberg