One of the greatest challenges facing biologists and the statisticians that work with them is the goal of representation learning to discover and define appropriate representation of data in order to perform complex, multi-scale machine learning tasks. This workshop is designed to bring together trainee and expert machine learning scientists with those in the very forefront of biological research for this purpose. Our full-day workshop will advance the joint project of the CS and biology communities with the goal of "Learning Meaningful Representations of Life" (LMRL), emphasizing interpretable representation learning of structure and principle.
We will organize around the theme "From Genomes to Phenotype, and Back Again": an extension of a long-standing effort in the biological sciences to assign biochemical and cellular functions to the millions of as-yet uncharacterized gene products discovered by genome sequencing. ML methods to predict phenotype from genotype are rapidly advancing and starting to achieve widespread success. At the same time, large scale gene synthesis and genome editing technologies have rapidly matured, and become the foundation for new scientific insight as well as biomedical and industrial advances. ML-based methods have the potential to accelerate and extend these technologies' application, by providing tools for solving the key problem of going "back again," from a desired phenotype to the genotype necessary to achieve that desired set of observable characteristics. We will focus on this foundational design problem and its application to areas ranging from protein engineering to phylogeny, immunology, vaccine design and next generation therapies.
Generative modeling, semi-supervised learning, optimal experimental design, Bayesian optimization, & many other areas of machine learning have the potential to address the phenotype-to-genotype problem, and we propose to bring together experts in these fields as well as many others.
LMRL will take place on Dec 13, 2021.
Tue 4:00 a.m. - 5:00 a.m.
|
All LMRL Events are accessible from our Gather.Town! ( GatherTown ) link » | 🔗 |
Tue 5:00 a.m. - 5:30 a.m.
|
Fritz Obermeyer ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 5:00 a.m. - 5:30 a.m.
|
Dagmar Kainmueller
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » |
🔗 |
Tue 5:30 a.m. - 6:00 a.m.
|
Mo Lotfollahi ( Live Talk, Zoom 2 ) link » | Mohammad Lotfollahi 🔗 |
Tue 5:59 a.m. - 6:00 a.m.
|
8:30-9:00 EST Steve Frank - The evolutionary paradox of robustness, genome overwiring, and analogies with deep learning
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » I start with the paradox of robustness, which is roughly: Greater protection from errors at the system level leads to more errors at the component level. The paradox of robustness may be an important force shaping the architecture of evolutionary systems. I then turn to the question: Why are genomes overwired? By which I mean that genetic regulatory networks seem to be more deeply and densely connected than one might expect. I suggest that the paradox of robustness may explain some of the observed complexity in genetic networks. Finally, I ask: What are the consequences of deeply and densely connected genetic networks? That question brings up to possible links between genetic networks, evolutionary dynamics, and learning dynamics in the deep neural networks of modern AI. |
Steven Frank 🔗 |
Tue 6:00 a.m. - 6:30 a.m.
|
Nancy Zhang - Data Denoising and Transfer Learning in Single Cell Transcriptomics
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » Cells are the basic biological units of multicellular organisms. The development of single-cell RNA sequencing (scRNA-seq) technologies have enabled us to study the diversity of cell types in tissue and to elucidate the roles of individual cell types in disease. Yet, scRNA-seq data are noisy and sparse, with only a small proportion of the transcripts that are present in each cell represented in the final data matrix. We propose a transfer learning framework based on deep neural nets to borrow information across related single cell data sets for de-noising and expression recovery. Our goal is to leverage the expanding resources of publicly available scRNA-seq data, for example, the Human Cell Atlas which aims to be a comprehensive map of cell types in the human body. Our method is based on a Bayesian hierarchical model coupled to a deep autoencoder, the latter trained to extract transferable gene expression features across studies coming from different labs, generated by different technologies, and/or obtained from different species. Through this framework, we explore the limits of data sharing: How much can be learned across cell types, tissues, and species? How useful are data from other technologies and labs in improving the estimates from your own study? If time allows, I will also discuss the implications of such data denoising to downstream statistical inference. |
🔗 |
Tue 6:00 a.m. - 6:30 a.m.
|
Matt Raybould ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 6:30 a.m. - 6:45 a.m.
|
15 min Break - Check out the posters on Gather Town! ( GatherTown ) link » | 🔗 |
Tue 6:45 a.m. - 7:15 a.m.
|
Frank Noe
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » |
🔗 |
Tue 6:45 a.m. - 7:15 a.m.
|
Jennifer Wei - Machine Learning for Chemical Sensing
(
Live Talk, Zoom 2
)
link »
I will present two applications of machine learning for molecular sensing: mass spectrometry and olfaction. Mass spectrometry is a method that chemists use to identify unknown molecules. Spectra from unknown samples are compared against existing libraries of mass spectra; highly matching spectra are considered candidates for the identity of the molecule. I will discuss some work in using machine learning models to predict mass spectra to expand the coverage of libraries to improve the ability of identifying spectra through mass spectrometry. The second project will discuss a more natural form of molecular sensing: olfaction. I will discuss some work my team has done in predicting human odor labels for individual molecules, and some of the resulting consequences |
🔗 |
Tue 7:15 a.m. - 7:45 a.m.
|
Su-In Lee ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 7:15 a.m. - 7:45 a.m.
|
Lyla Atta - RNA velocity-informed embeddings for visualizing cellular trajectories
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » Single-cell transcriptomics profiling technologies enable genome-wide gene expression measurements in individual cells but can currently only provide a static snapshot of cellular transcriptional states. RNA velocity analysis can help infer cell state changes using such single-cell transcriptomics data. To interpret these cell state changes inferred from RNA velocity analysis as part of underlying cellular trajectories, current approaches rely on visualization with principal components, t-distributed stochastic neighbor embedding and other 2D embeddings derived from the observed single-cell transcriptional states. However, these 2D embeddings can yield different representations of the underlying cellular trajectories, hindering the interpretation of cell state changes. We developed VeloViz to create RNA velocity-informed 2D and 3D embeddings from single-cell transcriptomics data. Using both real and simulated data, we demonstrate that VeloViz embeddings are able to capture underlying cellular trajectories across diverse trajectory topologies, even when intermediate cell states may be missing. By considering the predicted future transcriptional states from RNA velocity analysis, VeloViz can help visualize a more reliable representation of underlying cellular trajectories. |
Lyla Atta 🔗 |
Tue 7:45 a.m. - 8:15 a.m.
|
Jean-Phillippe Vert - Deep learning for DNA and proteins: equivariance and alignment
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » Deep learning and language models are increasingly used to model DNA and protein sequences. While many models and tasks are inspired and borrowed from the field of natural language processing, biological sequences have specificities that deserve attention. In this talk I will discuss two such specificities: 1) the inherent symmetry in double-stranded DNA sequences due to reverse-complement pairing, that calls for equivariant architectures, and 2) the fact that sequence alignment is a natural way to compare evolutionary related sequences. |
🔗 |
Tue 7:45 a.m. - 8:15 a.m.
|
Kristin Branson ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 8:15 a.m. - 8:30 a.m.
|
15 min Break - Check out the posters on Gather Town! ( GatherTown ) link » | 🔗 |
Tue 8:30 a.m. - 9:00 a.m.
|
Milo Lin - Distilling generalizable rules from data using Essence Neural Networks
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » Human reasoning can distill principles from observed patterns and generalize them to explain and solve novel problems, as exemplified in the success of scientific theories. The patterns in biological data are often complex and high dimensional, suggesting that machine learning could play a vital role in distilling collective rules from patterns that may be challenging for human reasoning. However, the most powerful artificial intelligence systems are currently limited in interpretability and symbolic reasoning ability. Recently, we developed essence neural networks (ENNs), which train to do general supervised learning tasks without requiring gradient optimization, and showed that ENNs are intrinsically interpretable, can generalize out-of-distribution, and perform symbolic learning on sparse data. Here, I discuss our current progress in automatically translating the weights of an ENN into concise, executable computer code for general symbolic tasks, an implementation of data-based automatic programming which we call deep distilling. The distilled code, which can contain loops, nested logical statements, and useful intermediate variables, is equivalent to the ENN neural network but is generally orders of magnitude more compact and human-comprehensible. Because the code is distilled from a general-purpose neural network rather than constructed by searching through libraries of logical functions, deep distilling is flexible in terms of problem domain and size. On a diverse set of problems involving arithmetic, computer vision, and optimization, we show that deep distilling generates concise code that generalizes out-of-distribution to solve problems orders-of-magnitude larger and more complex than the training data. For problems with a known ground-truth rule set, including cellular automata which encode a type of sequence-to-function mapping, deep distilling discovers the rule set exactly with scalable guarantees. For problems that are ambiguous or computationally intractable, the distilled rules are similar to existing human-derived algorithms and perform at par or better. Our approach demonstrates that unassisted machine intelligence can build generalizable and intuitive rules explaining patterns in large datasets that would otherwise overwhelm human detection and reasoning. |
🔗 |
Tue 8:30 a.m. - 9:00 a.m.
|
Georg Seelig - Machine learning-guided design of functional DNA, RNA and protein sequences ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 9:00 a.m. - 9:30 a.m.
|
Lacra Bintu - High-throughput discovery and characterization of human transcriptional repressor and activator domains
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » Human gene expression is regulated by thousands of proteins that can activate or repress transcription. To predict and control gene expression, we need to know where in the protein their effector domains are, and how strongly they activate or repress. To systematically measure the function of transcriptional effector domains in human cells, we developed a high- throughput assay in which pooled libraries of thousands of domains are recruited individually to a reporter gene. Cells are then separated by reporter expression level, and the library of protein domains is sequenced to determine the frequency of each domain in silenced versus active cell populations. We used this method to: 1) quantify the activation, silencing, and epigenetic memory capability of all nuclear protein domains annotated in Pfam, including the KRAB family of >300 domains. We find that while evolutionary young KRABs are strong repressors, some of the old KRABs are activators. 2) characterize the amino acids responsible for effector function via deep mutational scanning. We applied it to the KRAB used in CRISPRi to map the co-repressor binding surface and identify substitutions that improve stability, silencing, and epigenetic memory. 3) discover novel functional domains in unannotated regions of large transcription factors, including repressors as short as 10 amino acids. Together, these results provide a resource of 600 human proteins containing effectors, and demonstrate a scalable strategy for assigning functions to protein domains. |
Lacramioara Bintu 🔗 |
Tue 9:00 a.m. - 9:30 a.m.
|
Jingshu Wang - Model-based trajectory analysis for Single-Cell RNA Sequencing using deep learning with a mixture prior ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 9:30 a.m. - 10:00 a.m.
|
Qingyuan Zhao
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » |
🔗 |
Tue 9:30 a.m. - 10:00 a.m.
|
Jackson Loper - Latent representations reveal that stationary covariances are always secretly linear
(
Live Talk, Zoom 2
)
link »
We recently found that any continuous covariance for time-series data, no matter how intricate, can be approximated arbitrarily well in terms of a well-behaved parametric family of linear projections of linear stochastic dynamical systems. This family makes efficient exact inference a breeze, even for millions of time-points. Applied to ATAC-seq data, this machinery infers smooth representations that encode how chromatin accessibility varies (1) along the one-dimensional topology of each chromosome and (2) throughout the diversity of cells. |
🔗 |
Tue 10:00 a.m. - 10:15 a.m.
|
15 min Break - Check out the posters on Gather Town! ( GatherTown ) link » | 🔗 |
Tue 10:15 a.m. - 10:45 a.m.
|
Tatyana Sharpee ( Live Talk, Zoom 2 ) link » | 🔗 |
Tue 10:15 a.m. - 10:45 a.m.
|
Brian Trippe
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » |
🔗 |
Tue 10:45 a.m. - 11:15 a.m.
|
Jian Tang ( Live Talk, Zoom 1 ) link » | 🔗 |
Tue 10:45 a.m. - 11:15 a.m.
|
Antonio Moretti
(
Live Talk, Zoom 2
)
link »
SlidesLive Video » |
🔗 |
Tue 11:15 a.m. - 11:45 a.m.
|
Mackenzie Mathis ( Live Talk, Zoom 2 ) link » | Mackenzie Mathis 🔗 |
Tue 11:15 a.m. - 11:45 a.m.
|
Žiga Avsec
(
Live Talk, Zoom 1
)
link »
SlidesLive Video » |
Ziga Avsec 🔗 |
Tue 11:45 a.m. - 1:15 p.m.
|
Poster Session in Gather Town! ( Poster Session ) link » | 🔗 |
Tue 11:45 a.m. - 12:00 p.m.
|
15 min Break - Check out the posters on Gather Town! ( GatherTown ) link » | 🔗 |
Tue 11:45 a.m. - 1:05 p.m.
|
Panel: How do we define Meaningful Research in ML/Bio?
(
Discussion Panel, Zoom 1
)
link »
SlidesLive Video » |
🔗 |
Tue 12:00 p.m. - 12:30 p.m.
|
Bianca Dumitrascu - Beyond multimodality in genomics ( Live Talk, Zoom 2 ) link » | 🔗 |