Workshop
AI for Science: from Theory to Practice
Yuanqi Du · Max Welling · Yoshua Bengio · Marinka Zitnik · Carla Gomes · Jure Leskovec · Maria Brbic · Wenhao Gao · Kexin Huang · Ziming Liu · Rocío Mercado · Miles Cranmer · Shengchao Liu · Lijing Wang
Hall C2 (level 1 gate 9 south of food court)
AI is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain new insights that might not have been possible using traditional scientific methods alone. It has solved scientific challenges that were unimaginable before, e.g., predicting 3D protein structures, simulating molecular systems, forecasting global climate, and discovering new scientific laws. Despite this promise, several critical gaps stifle algorithmic and scientific innovation in "AI for Science," and the overarching goal of this workshop is to grow AI for Science by closing these gaps: * Gap 1: Science of science. The principles of scientific methods have remained unchanged since the 17th century. How AI can facilitate the practice of scientific discovery itself often remains undiscussed. For example, instead of the numerous hypothesis-experiment cycles to make sense of a scientific phenomenon, can AI reason and output natural laws directly?* Gap 2: Limited exploration at the intersections of multiple disciplines. Solutions to grand challenges stretch across various disciplines. For example, protein structure prediction requires collaboration across physics, chemistry, and biology, and single-cell imaging of whole tumors can be approached by cosmology algorithms that connect cells as stars.* Gap 3: Unified ecosystems of datasets, models, and scientific hypotheses. Comprehensive ecosystems and engagements of the research community, e.g., accumulation of datasets, open-source platforms, and benchmarks, are needed to reliably evaluate AI tools and integrate them into scientific workflows and instruments so that they can contribute to scientific understanding or acquire it autonomously. The workshop will emphasize this indispensable ingredient to the success of AI for Science and engage in discussions around it.* Gap 4: Responsible use and development of AI for science. Interest in AI across scientific disciplines has grown, but very few AI models have progressed to routine use in practice. We plan to present a roadmap and guidelines for accelerating the translation of AI in science. To be successful, translation will require a team of engaged stakeholders and a systematic process from beginning (problem formulation) to end (widespread deployment).* Gap 5: Lack of educational resources. A critical element to increase the adoption of AI for scientific discovery across disciplines is to create accessible education materials and AI-lab protocols for both AI researchers and scientists with different areas of expertise, seniority, and level of interest.* Gap 6: Unrealistic methodological assumptions or directions. While AI researchers strive for methodological advances, they can make unrealistic assumptions that can limit the applicability of new algorithms, their adoption in real-world settings, and transition into implementation (e.g., at a particle accelerator, genome sequencing lab, or quantum chemistry lab). For example, while state-of-the-art molecule generation AI models perform well on benchmarks, they often generate molecules that can't be synthesized in a lab.
Schedule
Sat 6:15 a.m. - 6:25 a.m.
|
Opening Remark
(
Opening Remark
)
>
SlidesLive Video |
🔗 |
Sat 6:25 a.m. - 6:55 a.m.
|
Invited Talk (Stven Brunton)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 6:55 a.m. - 7:25 a.m.
|
Invited Talk (Kyle Cranmer)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 7:25 a.m. - 7:55 a.m.
|
Invited Talk (Fabian Theis)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 7:55 a.m. - 8:05 a.m.
|
Coffee Break
(
Coffee Break
)
>
|
🔗 |
Sat 8:05 a.m. - 8:25 a.m.
|
Open Catalyst Project Introduction
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 8:25 a.m. - 8:35 a.m.
|
Open Catalyst Winner Talk
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 8:35 a.m. - 8:45 a.m.
|
Open Catalyst Runner-up Talk
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 8:45 a.m. - 8:50 a.m.
|
Open Catalyst Spotlight Talk
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 8:50 a.m. - 9:10 a.m.
|
Open Catalyst Discussion
(
Q&A
)
>
SlidesLive Video |
🔗 |
Sat 9:10 a.m. - 10:00 a.m.
|
Poster Session A
(
Poster Session
)
>
|
🔗 |
Sat 10:00 a.m. - 10:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Sat 10:30 a.m. - 11:00 a.m.
|
Invited Talk (Sara Beery)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 11:00 a.m. - 12:00 p.m.
|
Panel: Using AI to Accelerate Scientific Discovery
(
Panel Discussion
)
>
SlidesLive Video |
🔗 |
Sat 12:00 p.m. - 12:25 p.m.
|
Contributed Talk
(
Contributed Talk
)
>
SlidesLive Video |
🔗 |
Sat 12:25 p.m. - 12:35 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Sat 12:35 p.m. - 1:05 p.m.
|
Invited Talk (Sherrie Wang)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 1:05 p.m. - 1:35 p.m.
|
Invited Talk (Su-In Lee)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 1:35 p.m. - 2:05 p.m.
|
Invited Talk (Alán Aspuru-Guzik)
(
Invited Talk
)
>
SlidesLive Video |
🔗 |
Sat 2:05 p.m. - 2:30 p.m.
|
Contributed Talk
(
Contributed Talk
)
>
SlidesLive Video |
🔗 |
Sat 2:30 p.m. - 2:35 p.m.
|
Closing Remark
(
Closing Remark
)
>
SlidesLive Video |
🔗 |
Sat 2:35 p.m. - 3:30 p.m.
|
Poster Session B
(
Poster Session
)
>
|
🔗 |
-
|
Gradient Estimation For Exactly-$k$ Constraints
(
Poster
)
>
link
The exactly-$k$ constraint is ubiquitous in machine learning and scientific applications, such as ensuring that the sum of electric charges in a neutral atom is zero. However, enforcing such constraints in machine learning models while allowing differentiable learning is challenging. In this work, we aim to provide a ''cookbook'' for seamlessly incorporating exactly-$k$ constraints into machine learning models by extending a recent gradient estimator from Bernoulli variables to Gaussian and Poisson variables, utilizing constraint probabilities. We show the effectiveness of our proposed gradient estimators in synthetic experiments, and further demonstrate the practical utility of our approach by training neural networks to predict partial charges for metal-organic frameworks, aiding virtual screening in chemistry. Our proposed method not only enhances the capability of learning models but also expands their applicability to a wider range of scientific domains where satisfaction of constraints is crucial.
|
Ruoyan Li · Dipti Ranjan Sahu · Guy Van den Broeck · Zhe Zeng 🔗 |
-
|
Learning Expert-Interpretable Programs for Myocardial Infarction Localization
(
Poster
)
>
link
We study how to learn accurate and interpretable models for assisted clinical diagnostics. We focus on myocardial infarction (heart attack) localization from electrocardiogram (ECG) signals, which is known to have a complex mapping that is challenging even for expert cardiologists to understand. Our approach leverages recent advances in learning neurosymbolic models, and yields inherently expert interpretable programs as compositions of ECG features and learned temporal filters. We evaluate our method on a set of 21,844 ECG recordings, to localize myocardial infarction at different levels of granularity. Results demonstrate that our model performs comparably to conventional black-box baselines, but with a much simpler and more interpretable structure. |
Joshua Flashner · Jennifer J Sun · David Ouyang · Yisong Yue 🔗 |
-
|
Predicting the Initial Conditions of the Universe using a Deterministic Neural Network
(
Poster
)
>
link
Finding the initial conditions that led to the current state of the universe is challenging because it involves searching over an intractable input space of initial conditions, along with modeling their evolution via tools such as N-body simulations which are computationally expensive. Recently, deep learning has emerged as a surrogate for N-body simulations by directly learning the mapping between the linear input of an N-body simulation and the final nonlinear output from the simulation, significantly accelerating the forward modeling. However, this still does not reduce the search space for initial conditions. In this work, we pioneer the use of a deterministic convolutional neural network for learning the reverse mapping and show that it accurately recovers the initial linear displacement field over a wide range of scales ($<1$-$2$% error up to nearly $k \simeq 0.8$ - $0.9 \text{ Mpc}^{-1}h$), despite the one-to-many mapping of the inverse problem (due to the divergent backward trajectories at smaller scales). Specifically, we train a V-Net architecture, which outputs the linear displacement of an N-body simulation, given the nonlinear displacement at redshift $z=0$ and the cosmological parameters. The results of our method suggest that a simple deterministic neural network is sufficient for accurately approximating the initial linear states, potentially obviating the need for the more complex and computationally demanding backward modeling methods that were recently proposed.
|
Vaibhav Jindal · Albert Liang · Aarti Singh · Shirley Ho · Drew Jamieson 🔗 |
-
|
Holistic chemical evaluation reveals pitfalls in reaction prediction models
(
Oral
)
>
link
The prediction of chemical reactions has gained significant interest within the machine learning community in recent years, owing to its complexity and crucial applications in chemistry. However, model evaluation for this task has been mostly limited to simple metrics like top-k accuracy, which obfuscates fine details of a model's limitations. Inspired by progress in other fields, we propose a new assessment scheme that builds on top of current approaches, steering towards a more holistic evaluation. We introduce the following key components for this goal: ChORISOv1, a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios, and a collection of metrics that provide a holistic view of a model's advantages and limitations. Application of this method to state-of-the-art models reveals important differences on sensitive fronts, especially stereoselectivity and chemical out-of-distribution generalization. Our work paves the way towards robust prediction models that can ultimatelyaccelerate chemical discovery. |
Victor Sabanza Gil · Andres M Bran · Malte Franke · Jeremy Luterbacher · Philippe Schwaller 🔗 |
-
|
Towards LLMs as Operational Copilots for Fusion Reactors
(
Poster
)
>
link
The tokamak is one of the most promising approaches for achieving nuclear fusion as an energy source. As such, many tokamaks have been built with rich experimental histories and datasets. While the quantitative data generated by tokamaks is invaluable, tokamaks also generate another, often underutilized data stream: text logs written by experimental operators. In this work, we leverage these extensive text logs by employing Retrieval-Augmented Generation (RAG) with state-of-the-art large language models (LLMs) to create a prototype "copilot". Instances of this copilot were created using text logs from the fusion experiments DIII-D and Alcator C-Mod and deployed for researchers to use. In this paper, we report on the datasets and methodology used to create this ``copilot", along with its performance on three use cases: 1) semantic search of experiments, 2) assisting with device-specific operations, and 3) answering general tokamak questions. Although we found via a survey of researchers that for general tokamak operations questions RAG doesn't offer a clear advantage over the base GPT-4 model, in the first two use cases, we observe clear advantages that RAG offers over base LLMs and simple keyword search. |
Viraj Mehta · Joseph Abbate · Allen Wang · Andrew Rothstein · Ian Char · Jeff Schneider · Egemen Kolemen · Cristina Rea · Darren Garnier 🔗 |
-
|
On Modelability and Generalizability: Are Machine Learning Models for Drug Synergy Exploiting Artefacts and Biases in Available Data?
(
Poster
)
>
link
Synergy models are useful tools for exploring drug combinatorial search space and identifying promising sub-spaces for in vitro/vivo experiments. Here, we report that distributional biases in the training-validation-test sets used for predictive modeling of drug synergy can explain much of the variability observed in model performances (up to $0.22$ $\Delta$AUPRC). We built 145 classification models spanning 4,577 unique drugs and 75,276 pair-wise drug combinations extracted from DrugComb, and examined spurious correlations in both the input feature and output label spaces. We posit that some synergy datasets are easier to model than others due to factors such as synergy spread, class separation, chemical structural diversity, physicochemical diversity, combinatorial tests per drug, and combinatorial label entropy. We simulate distribution shifts for these dataset attributes and report that the drug-wise homogeneity of combinatorial labels most influences modelability ($0.16\pm0.06$ $\Delta$AUPRC). Our findings imply that seemingly high-performing drug synergy models may not generalize well to broader medicinal space. We caution that the synergy modeling community's efforts may be better expended in examining data-specific artefacts and biases rigorously prior to model building.
|
Arushi Gandhi · Andreas Bender · Ian Stott 🔗 |
-
|
LenSiam: Self-Supervised Learning on Strong Gravitational Lens Images
(
Poster
)
>
link
Self-supervised learning has been known for learning good representations from data without the need for annotated labels. We explore the simple siamese (SimSiam) architecture for representation learning on strong gravitational lens images. Commonly used image augmentations tend to change lens properties; for example, zoom-in would affect the Einstein radius. To create image pairs representing the same underlying lens model, we introduce a lens augmentation method to preserve lens properties by fixing the lens model while varying the source galaxies. Our research demonstrates this lens augmentation works well with SimSiam for learning the lens image representation without labels, so we name it LenSiam. We also show that a pre-trained LenSiam model can benefit downstream tasks. We plan to open-source our code and datasets. |
Po-Wen Chang · Kuan-Wei Huang · Joshua Fagin · James Chan · Joshua Yao-Yu Lin 🔗 |
-
|
Surrogate Modeling for Computationally Expensive Simulations of Supernovae in High-Resolution Galaxy Simulations
(
Poster
)
>
link
Some stars are known to explode at the end of their lives, called supernovae (SNe). The substantial amount of matter and energy that SNe release provides significant feedback to star formation and gas dynamics in a galaxy. SNe release a substantial amount of matter and energy to the interstellar medium, resulting in significant feedback to star formation and gas dynamics in a galaxy. While such feedback has a crucial role in galaxy formation and evolution, in simulations of galaxy formation, it has only been implemented using simple {\it sub-grid models} instead of numerically solving the evolution of gas elements around SNe in detail due to a lack of resolution. We develop a method combining machine learning and Gibbs sampling to predict how a supernova (SN) affects the surrounding gas. The fidelity of our model in the thermal energy and momentum distribution outperforms the low-resolution SN simulations. Our method can replace the SN sub-grid models and help properly simulate un-resolved SN feedback in galaxy formation simulations. We find that employing our new approach reduces the necessary computational cost to $\sim$ 1 percent compared to directly resolving SN feedback.
|
Keiya Hirashima · Kana Moriwaki · Michiko Fujii · Yutaka Hirai · Takayuki Saitoh · Junichiro Makino · Shirley Ho 🔗 |
-
|
Seismic hazard analysis with a Factorized Fourier Neural Operator (F-FNO) surrogate model enhanced by transfer learning
(
Poster
)
>
link
Seismic hazard analyses in the area of a nuclear installation must account for a large number of uncertainties, including limited geological knowledge. It is known that some geological features can create site-effects that considerably amplify ground motion. Combining the accuracy of physics-based simulations with the expressivity of deep neural networks can help quantifying the influence of geological heterogeneities on surface ground motion.This work demonstrates the use of a Factorized Fourier Neural Operator (F-FNO) that learns the relationship between 3D heterogeneous geologies and time-dependent surface wavefields. The F-FNO was pretrained on the generic HEMEW-3D database with 30 000 samples. Then, a smaller database was built specifically for the region of the Le Teil earthquake (South-Eastern France) and the F-FNO was further trained with only 250 specific samples. Transfer learning improved the prediction error by 22 %.As quantified by the Goodness-Of-Fit (GOF) criteria, 90% of predictions had excellent phase GOF (62% for the enveloppe GOF). Although the intensity measures of surface ground motion were, in average, slightly underestimated by the FNO, considering a set of heterogeneous geologies always led to ground motion intensities larger than those obtained from a single homogeneous geology. These results suggest that neural operators are an efficient tool to quantify the range of ground motions a nuclear installation could face in the presence of geological uncertainties. |
Fanny Lehmann · Filippo Gatti · Michaël Bertin · Didier Clouteau 🔗 |
-
|
BENO: Boundary-embedded Neural Operators for Elliptic PDEs
(
Oral
)
>
link
Elliptic partial differential equations (PDEs) are a major class of time-independent PDEs that play a key role in many scientific and engineering domains such as fluid dynamics, plasma physics, and solid mechanics. Recently, neural operators have emerged as a promising technique to solve elliptic PDEs more efficiently by directly mapping the input to solutions. However, existing networks typically neglect complex geometries and inhomogeneous boundary values present in the real world. Here we introduce Boundary-Embedded Neural Operators (BENO), a novel neural operator architecture that embeds the complex geometries and inhomogeneous boundary values into the solving of elliptic PDEs. Inspired by classical Green's function, BENO consists of two Graph Neural Networks (GNNs) for interior source term and boundary values, respectively. Furthermore, a Transformer encoder maps the global boundary geometry into a latent vector which influences each message passing layer of the GNNs. We test our model and strong baselines extensively in elliptic PDEs with complex boundary conditions. We show that all existing baseline methods fail to learn the solution operator. In contrast, our model, endowed with boundary-embedded architecture, outperforms state-of-the-art neural operators and strong baselines by an average of 60.96%. |
Haixin Wang · Jiaxin LI · Anubhav Dwivedi · Kentaro Hara · Tailin Wu 🔗 |
-
|
Machine Learning for Practical Quantum Error Mitigation
(
Poster
)
>
link
Quantum computers are actively competing to surpass classical supercomputers, but quantum errors remain their chief obstacle. The key to overcoming these on near-term devices has emerged through the field of quantum error mitigation, enabling improved accuracy at the cost of additional runtime. In practice, however, the success of mitigation is limited by a generally exponential overhead. Can classical machine learning address this challenge on today's quantum computers? Here, through both simulations and experiments on state-of-the-art quantum computers using up to 100 qubits, we demonstrate that machine learning for quantum error mitigation (ML-QEM) can drastically reduce overheads, maintain or even surpass the accuracy of conventional methods, and yield near noise-free results for quantum algorithms. We benchmark a variety of machine learning models---linear regression, random forests, multi-layer perceptrons, and graph neural networks---on diverse classes of quantum circuits, over increasingly complex device-noise profiles, under interpolation and extrapolation, and for small and large quantum circuits. These tests employ the popular digital zero-noise extrapolation method as an added reference. We further show how to scale ML-QEM to classically intractable quantum circuits by mimicking the results of traditional mitigation results, while significantly reducing overhead. Our results highlight the potential of classical machine learning for practical quantum computation. |
Haoran Liao · Derek Wang · Iskandar Sitdikov · Ciro Salcedo · Alireza Seif · Zlatko Minev 🔗 |
-
|
Distilling human decision-making dynamics: a comparative analysis of low-dimensional architectures
(
Poster
)
>
link
Recent advances in examining biological decision-making behaviors have increasingly favored recurrent neural networks (RNNs) over traditional cognitive models grounded in normative principles such as reinforcement learning. This shift owes to RNN’s superior predictive performance on behavioral data, achieved with minimal manual engineering. To glean insights into biological decision-making through these networks, this approach focuses on identifying a compact set of latent dynamical variables by restricting the size of the recurrent layer's bottleneck. Yet, little is known about the distinctions between these low-dimensional RNN architectures and their practical effectiveness in capturing behavioral patterns of biological decision-making. Our study bridges this knowledge gap by 1) offering a comprehensive comparison of these low-dimensional RNN architectures with standardized terminology; 2) evaluating their predictive accuracy for human decision-making in an explore-exploit task; and 3) delivering these RNN-derived insights that traditional cognitive models overlook. Remarkably, our findings highlight the superiority of low-rank RNNs over alternatives like gated recurrent units or disentangled RNNs in this task setting. More crucially, these low-rank RNNs reveal diverse strategies that individuals employ across different decision-making phases, advancing our understanding of intricate human decision-making dynamics. Our approach offers a powerful framework for discerning individual cognitive nuances. |
Huadong Xiong · Li Ji-An · Marcelo G Mattar · Robert Wilson 🔗 |
-
|
Genomic language model predicts protein co-regulation and function
(
Poster
)
>
link
Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets; However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Here, we trained a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and appears to encode biologically meaningful and functionally relevant information (e.g. enzymatic function). Our analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons). Our findings illustrate that gLM’s unsupervised deep learning of the metagenomic corpus is an effective and promising approach to encode functional semantics and regulatory syntax of genes in their genomic contexts and uncover complex relationships between genes in a genomic region. |
Yunha Hwang · Andre Cornman · Elizabeth Kellogg · Sergey Ovchinnikov · Peter Girguis 🔗 |
-
|
Learning to Scale Logits for Temperature-Conditional GFlowNets
(
Poster
)
>
link
GFlowNets are probabilistic models that learn a stochastic policy that sequentially generates compositional structures, such as molecular graphs. They are trained with the objective of sampling such objects with probability proportional to the object's reward.Among GFlowNets, the temperature-conditional GFlowNets represent a family of policies indexed by temperature, and each is associated with the correspondingly tempered reward function. The major benefit of temperature-conditional GFlowNets is the controllability of GFlowNets' exploration and exploitation through adjusting temperature. We propose a \textit{Learning to Scale Logits for temperature-conditional GFlowNets} (LSL-GFN), a novel architectural design that greatly accelerates the training of temperature-conditional GFlowNets. It is based on the idea that previously proposed temperature-conditioning approaches introduced numerical challenges in the training of the deep network because different temperatures may give rise to very different gradient profiles and ideal scales of the policy's logits. We find that the challenge is greatly reduced if a learned function of the temperature is used to scale the policy's logits directly. We empirically show that our strategy dramatically improves the performances of GFlowNets, outperforming other baselines, including reinforcement learning and sampling methods, in terms of discovering diverse modes in multiple biochemical tasks. |
Minsu Kim · Joohwan Ko · Dinghuai Zhang · Ling Pan · Yun TaeYoung · Woochang Kim · Jinkyoo Park · Yoshua Bengio 🔗 |
-
|
Auto-PINN: Understanding and Optimizing Physics-Informed Neural Architecture
(
Poster
)
>
link
Physics-Informed Neural Networks (PINNs) are revolutionizing science and engineering practices by harnessing the power of deep learning for scientific computation. The neural architecture's hyperparameters significantly impact the efficiency and accuracy of the PINN solver. However, optimizing these hyperparameters remains an open and challenging problem because of the large search space and the difficulty in identifying a suitable search objective for PDEs. In this paper, we propose Auto-PINN, the first systematic, automated hyperparameter optimization approach for PINNs, which employs Neural Architecture Search (NAS) techniques for PINN design. Auto-PINN avoids manually or exhaustively searching the hyperparameter space associated with PINNs. A comprehensive set of pre-experiments, using standard PDE benchmarks, enables us to probe the structure-performance relationship in PINNs. We discover that the different hyperparameters can be decoupled and that the training loss function of PINNs serves as an effective search objective. Comparison experiments with baseline methods demonstrate that Auto-PINN produces neural architectures with superior stability and accuracy over alternative baselines. |
Yicheng Wang · Xiaotian Han · Chia-Yuan Chang · Daochen Zha · Ulisses M. Braga-Neto · Xia Hu 🔗 |
-
|
Augmenting large language models with chemistry tools
(
Poster
)
>
link
Over the last decades, excellent computational chemistry tools have been developed. Integrating them into a single platform with enhanced accessibility could help reaching their full potential by overcoming steep learning curves. Recently, large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. Moreover, these models lack access to external knowledge sources, limiting their usefulness in scientific applications. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. By integrating 18 expert-designed tools, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned and executed the syntheses of an insect repellent, three organocatalysts, and guided the discovery of a novel chromophore. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow’s effectiveness in automating a diverse set of chemical tasks. Surprisingly, we find that GPT-4 as an evaluator cannot distinguish between clearly wrong GPT-4 completions and Chemcrow’s performance. Our work not only aids expert chemists and lowers barriers for non-experts, but also fosters scientific advancement by bridging the gap between experimental and computational chemistry. |
Andres M Bran · Sam Cox · Oliver Schilter · Carlo Baldassari · Andrew White · Philippe Schwaller 🔗 |
-
|
Scalable Particle Generation for Granular Shape Study
(
Poster
)
>
link
The shape of granular matter (particle) is crucial for understanding their properties and assembly behavior. Existing studies often rely on intuitive or machine-derived shape descriptors (e.g. sphericity and Corey shape factors) and are usually carried out on single, individual particles with specific shape features, lacking statistical evaluation on a large number of particles. Meanwhile, it is also questionable whether the pre-selected shape descriptors would sufficiently capture the rich morphological information provided by the particle. In this paper, we first propose a two-step particle generation pipeline to evaluate the quality of the previous shape descriptors. To overcome the scarcity issue of particle samples, we explicitly use a Metaball-Imaging algorithm to transform pixel data into a lower-dimensional space and propose a conditional generative method to design 3D realistic style particles. Meanwhile, we also design a new shape estimator to provide shape constraints to guide the conditional generation process. Building on this, we then propose "attribute twins" --- particles that share identical shape features but differ in actual morphologies. Attribute twins provide essential particle samples to investigate whether existing shape descriptors are sufficient to represent the effects of particle shape. In a series of simulations focusing on the drag force experienced by settling particles in a fluid, we use these distilled attribute twins under different constraints of single or multiple shape descriptors. Our results shed light on the limitations of current shape descriptors in representing the influence of particle shape in this physical process and highlight the need for improved shape descriptors in the future. |
Yifeng Zhao · Jinxin Liu · Xiangbo Gao · Pei Zhang · Sergio Andres Galindo Torres · Stan Z. Li 🔗 |
-
|
AdsGT: Graph Transformer for Predicting Global Minimum Adsorption Energy
(
Poster
)
>
link
The fast assessment of the binding strength between adsorbates and catalyst surfaces is crucial for catalyst design, where global minimum adsorption energy (GMAE) is one of the most representative descriptors. However, catalyst surfaces typically have multiple adsorption sites and numerous possible adsorption configurations, which makes it prohibitively expensive to calculate the GMAE using Density Functional Theory (DFT). Additionally, most machine learning methods can only predict local minimum adsorption energies and rely on information about adsorption configurations. To overcome these challenges, we designed a graph transformer (AdsGT) that can predict the GMAE based on surface graphs and adsorbate feature vectors without any binding structure information. To evaluate the performance of AdsGT, three new datasets on GMAE were constructed from OC20-Dense, Catalysis Hub, and FG-dataset. For a wide range of combinations of catalyst surfaces and adsorbates, AdsGT achieves test mean absolute errors of 0.10 and 0.14 eV on the two GMAE datasets respectively, demonstrating its good reliability and generalizability. |
Junwu Chen · Xu Huang · Cheng Hua · Yulian He · Philippe Schwaller 🔗 |
-
|
Generation of 3D Realistic Soil Particles with Metaball Descriptor
(
Poster
)
>
link
The accurate representation of soil particle morphology is crucial for understanding its granular characteristics and assembly responses. However, incorporating realistic and diverse particle morphologies into modeling presents challenges, often requiring time-consuming and expensive X-ray Computed Tomography (XRCT). This has resulted in a prevalent issues in modeling: morphological particle generation. On this topic, we introduce the Metaball Variational Autoencoder. This method leverages deep neural networks to generate new 3D particles in the form of Metaballs while preserving essential morphological features from the parental particles. Furthermore, the method allows for shape control through an arithmetic pattern, enabling the generation of particles with specific shapes. We validate the generation fidelity by comparing the morphologies and shape-feature distributions of the generated particles with the parental data. Additionally, we provide examples to demonstrate the controllability of the generated shapes. By integrating these methods into the Metaball-based simulation framework proposed by the authors previously, we enable the incorporation of real particle shapes into simulations. This could facilitate the simulation of a large number of soil particles with varying shapes and behaviors, providing valuable insights into the properties and behavior of actual soil particles. |
Yifeng Zhao · Jinxin Liu · Xiangbo Gao · Pei Zhang · Stan Z. Li · Sergio Andres Galindo Torres 🔗 |
-
|
AI for Open Science: A Multi-Agent Perspective for Ethically Translating Data to Knowledge
(
Poster
)
>
link
AI for Science (AI4Science), particularly in the form of self-driving labs, has the potential to sideline human involvement and hinder scientific discovery within the broader community. While prior research has focused on ensuring the responsible deployment of AI applications, enhancing security, and ensuring interpretability, we also propose that promoting openness in AI4Science discoveries should be carefully considered. In this paper, we introduce the concept of AI for Open Science (AI4OS) as a multi-agent extension of AI4Science with the core principle of maximizing open knowledge translation throughout the scientific enterprise rather than a single organizational unit. We use the established principles of Knowledge Discovery and Data Mining (KDD) to formalize a language around AI4OS. We then discuss three principle stages of knowledge translation embedded in AI4Science systems and detail specific points where openness can be applied to yield an AI4OS alternative. Lastly, we formulate a theoretical metric to assess AI4OS with a supporting ethical argument highlighting its importance. Our goal is that by drawing attention to AI4OS we can ensure the natural consequence of AI4Science (e.g., self-driving labs) is a benefit not only for its developers but for society as a whole. |
Chase Yakaboski · Gregory Hyde · Clement Nyanhongo · Eugene Santos 🔗 |
-
|
Testing Assumptions Underlying a Unified Theory for the Origin of Grid Cells
(
Poster
)
>
link
Representing and reasoning about physical space is fundamental to animal survival, and the mammalian lineage expresses a wealth of specialized neural representations that encode space. Grid cells, whose discovery earned a Nobel prize, are a striking example: a grid cell is a neuron that fires if and only if the animal is spatially located at the vertices of a regular triangular lattice that tiles all explored two-dimensional environments. Significant theoretical work has gone into understanding why mammals have learned these particular representations, and recent work has proposed a ``unified theory for the computational and mechanistic origin of grid cells," claiming to answer why the mammalian lineage has learned grid cells. However, the Unified Theory makes a series of highly specific assumptions about the target readouts of grid cells - putatively place cells. In this work, we explicitly identify what these mathematical assumptions are, then test two of the critical assumptions using biological place cell data. At both the population and single-cell levels, we find evidence suggesting that neither of the assumptions are likely true in biological neural representations. These results call the Unified Theory into question, suggesting that biological grid cells likely have a different origin than those obtained in trained artificial neural networks. |
Rylan Schaeffer · Mikail Khona · Adrian Bertagnoli · Sanmi Koyejo · Ila Fiete 🔗 |
-
|
arXiVeri: Automatic table verification with GPT
(
Poster
)
>
link
Without accurate transcription of numerical data in scientific documents, a scientistcannot draw accurate conclusions. Unfortunately, the process of copying numericaldata from one paper to another is prone to human error. In this paper, we propose tomeet this challenge through the novel task of automatic table verification (AutoTV),in which the objective is to verify the accuracy of numerical data in tables bycross-referencing cited sources. To support this task, we propose a new benchmark,arXiVeri, which comprises tabular data drawn from open-access academic paperson arXiv. We introduce metrics to evaluate the performance of a table verifier intwo key areas: (i) table matching, which aims to identify the source table in a citeddocument that corresponds to a target table, and (ii) cell matching, which aims tolocate shared cells between a target and source table and identify their row andcolumn indices accurately. By leveraging the flexible capabilities of modern largelanguage models (LLMs), we propose simple baselines for table verification. Ourfindings highlight the complexity of this task, even for state-of-the-art LLMs likeOpenAI’s GPT-4. |
Gyungin Shin · Gyungin Shin · Weidi Xie · Samuel Albanie 🔗 |
-
|
Exploring the Properties and Structure of Real Knowledge Graphs across Scientific Disciplines
(
Poster
)
>
link
Despite the recent popularity of knowledge graph (KG) related tasks and benchmarks such as KG embeddings, link prediction, entity alignment and their use in many domains, the structure and properties of real KGs are not well studied. In this paper, we perform a large scale comparative study of 29 real KG datasets from diverse domains such as the natural sciences, medicine, and NLP to analyze theirproperties and structural patterns. Based on our findings we make recommendations regarding KG-based model development and evaluation. We believe that the rich structural information contained in KGs can benefit the development of better KG models across fields and we hope this study will contribute to breaking down the existing data silos between different scientific disciplines (e.g., biomedicine,ML/NLP, ’AI for Sciences’). |
Nedelina Teneva · Estevam Hruschka 🔗 |
-
|
Lineax: unified linear solves and linear least-squares in JAX and Equinox
(
Poster
)
>
link
We introduce Lineax, a library bringing linear solves and linear least-squares to the JAX+Equinox scientific computing ecosystem. Lineax uses general linear operators, and unifies linear solves and least-squares into a single, autodifferentiable API. Solvers and operators are user-extensible, without requiring the user to implement any custom derivative rules to get differentiability. Lineax is available at https://github.com/$\textbf{anonymised}$/lineax.
|
Jason Rader · Terry Lyons 🔗 |
-
|
SE(3) Equivariant Augmented Coupling Flows
(
Poster
)
>
link
Coupling normalizing flows allow for fast sampling and density evaluation, making them the tool of choice for probabilistic modeling of physical systems. However, the standard coupling architecture precludes endowing flows that operate on the Cartesian coordinates of atoms with the SE(3) and permutation invariances of physical systems. This work proposes a coupling flow that preserves SE(3) and permutation equivariance by performing coordinate splits along additional augmented dimensions. At each layer, the flow maps atoms' positions into learned SE(3) invariant bases, where we apply standard flow transformations, such as monotonic rational-quadratic splines, before returning to the original basis. Crucially, our flow preserves fast sampling and density evaluation, and may be used to produce unbiased estimates of expectations with respect to the target distribution via importance sampling. When trained on the DW4, LJ13 and QM9-positional datasets, our flow is competitive with equivariant continuous normalizing flows, while allowing sampling two orders of magnitude faster. Moreover, we learn the full Boltzmann distribution of alanine dipeptide by only modeling the Cartesian positions of its atoms, and demonstrate that our flow can be trained to approximately sample from the Boltzmann distribution of the DW4 and LJ13 using only their energy functions. |
Laurence Midgley · Vincent Stimper · Vincent Stimper · Javier Antorán · Emile Mathieu · Emile Mathieu · Bernhard Schölkopf · Bernhard Schölkopf · José Miguel Hernández-Lobato 🔗 |
-
|
Randomized Benchmarking of Local Zeroth-Order Optimizers for Variational Quantum Systems
(
Poster
)
>
link
In the field of quantum information, classical optimizers play an important role. From experimentalists optimizing their physical devices to theorists exploring variational quantum algorithms, many aspects of quantum information require the use of a classical optimizer. For this reason, there are many papers that benchmark the effectiveness of different optimizers for specific quantum learning tasks and choices of parameterized algorithms. However, for researchers exploring new algorithms or physical devices, the insights from these studies don't necessarily translate. To address this concern, we compare the performance of classical optimizers across a series of partially-randomized tasks to more broadly sample the space of quantum learning problems. We focus on local zeroth-order optimizers due to their generally favorable performance and query-efficiency on quantum systems. We discuss insights from these experiments that can help motivate future works to improve these optimizers for use on quantum systems. |
Lucas Tecot · Cho-Jui Hsieh 🔗 |
-
|
Using the Transformer Model for Physical Simulation: An application on Transient Thermal Analysis for 3D Printing Process Simulation
(
Poster
)
>
link
Transient thermal analysis is widely used in many science and engineering areas such as electronic packaging, engine design and manufacturing. High dimensional simulations are very expensive to run. Here we propose a machine learning model consisting of a pre-trained convolutional neural network (CNN), a transformer encoder and a multilayer perceptron (MLP) to predict the temperature field of 3D printed parts. The CAD part used in 3D printing is firstly sliced into layers and represented as images. We use the pre-trained ResNet 34 to extract low level geometry features, taking the output feature map of its Conv_4 layer as the geometry embedding vector. The transformer encoder are used to capture the long-range dependencies between layer-wise geometry features. MLP then takes the transformer's output and predicts the temperatures at given locations and time step. Our results show the model can accurately predict the thermal history in 3D printing process on different geometries. Our model is also very efficient, running 1~2 orders of magnitude faster than the simulation on which it is trained, without requiring the complicated pre-processing steps in transient thermal analysis including CAD file fix, material property setup, mesh generation and refinement, and defining the boundary conditions and dynamic loading in every time step. |
Qian Chen · Luyang Kong · Florian Dugast · Albert To 🔗 |
-
|
PGraphDTA: Improving Drug Target Interaction Prediction using Protein Language Models and Contact Maps
(
Poster
)
>
link
Developing and discovering new drugs is a complex and resource-intensive endeavor that often involves substantial costs, time investment, and safety concerns. A key aspect of drug discovery involves identifying novel drug-target (DT) interactions. Existing computational methods for predicting DT interactions have primarily focused on binary classification tasks, aiming to determine whether a DT pair interacts or not. However, protein-ligand interactions exhibit a continuum of binding strengths, known as binding affinity, presenting a persistent challenge for accurate prediction. In this study, we investigate various techniques employed in Drug Target Interaction (DTI) prediction and propose novel enhancements to enhance their performance. Our approaches include the integration of Protein Language Models (PLMs) and the incorporation of Contact Map information as an inductive bias within current models. Through extensive experimentation, we demonstrate that our proposed approaches outperform the baseline models considered in this study, presenting a compelling case for further development in this direction. We anticipate that the insights gained from this work will significantly narrow the search space for potential drugs targeting specific proteins, thereby accelerating drug discovery. Code and data for PGraphDTA are available at https://anonymous.4open.science/r/PGraphDTA. |
Rakesh Bal · Yijia Xiao · Wei Wang 🔗 |
-
|
Discovery of Novel Reticular Materials for Carbon Dioxide Capture using GFlowNets
(
Poster
)
>
link
Artificial intelligence holds promise to improve materials discovery. GFlowNets are an emerging deep learning algorithm with many applications in AI-assisted discovery. By using GFlowNets, we generate porous reticular materials, such as metal organic frameworks and covalent organic frameworks, for applications in carbon dioxide capture. We introduce a new Python package (matgfn) to train and sample GFlowNets. We use matgfn to generate the matgfn-rm dataset of novel and diverse reticular materials with gravimetric surface area above 5000 $m^2 /g$. We calculate single- and two-component gas adsorption isotherms for the top-100 candidates in matgfn-rm. These candidates are novel compared to the state-of-art ARC-MOF dataset and rank in the 90th percentile in terms of working capacity compared to the CoRE2019 dataset. We discover 15 materials outperforming all materials in CoRE2019.
|
Flaviu Cipcigan · Flaviu Cipcigan · Jonathan Booth · Jonathan Booth · Rodrigo Neumann Barros Ferreira · Carine Dos Santos · Carine Dos Santos · Mathias Steiner · Mathias Steiner 🔗 |
-
|
Extracting Nonlinear Symmetries From Trained Neural Networks on Dynamics Data
(
Poster
)
>
link
To support scientists who are developing the reduced model of complex physics systems, we propose a method for extracting interpretable physics information from a deep neural network (DNN) trained on time series data of a physics system. Specifically, we propose a method for estimating the hidden nonlinear symmetries of a system from a DNN trained on time series data that can be regarded as a finite-degree-of-freedom classical Hamiltonian dynamical system. Our proposed method can estimate the nonlinear symmetries corresponding to the Lungerenz vector, a conservation value that keeps the long-axis direction of the elliptical motion of a planet constant, and visualize its Lie manifold. |
Yoh-ichi Mototake 🔗 |
-
|
Modelling single-cell RNA-seq trajectories on a flat statistical manifold
(
Oral
)
>
link
In this study, we introduce a novel approach for enhancing the use of Optimal Transport (OT) in analysing gene expression trajectories within single-cell RNA-seq data. In contrast to existing methods that often depend on linear embeddings or Gaussian autoencoder latent spaces, our approach, performing OT-based trajectory inference on statistical manifolds, accounts for critical data characteristics such as sparsity, overdispersion, and geometry. We achieve this by implementing a "flattening" regularisation derived from the pullback metric of a negative binomial statistical manifold, ensuring alignment between the latent space of a discrete Variational Autoencoder (VAE) and Euclidean space, thereby improving compatibility with linear OT. Our real data results demonstrate that these constraints benefit the reconstruction of latent trajectories and the learning of velocity fields. We believe that this versatile approach holds promise for advancing single-cell representation learning and temporal modelling in the future. |
Alessandro Palma · Alessandro Palma · Sergei Rybakov · Leon Hetzel · Fabian Theis · Fabian Theis 🔗 |
-
|
GeoMFormer: A General Architecture for Geometric Molecular Representation Learning
(
Poster
)
>
link
Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular representations under these constraints, most of them are built upon heuristic and costly modules. We argue that there is a strong need for a general and flexible framework for learning both invariant and equivariant features. In this work, we introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal. Using the standard Transformer modules, two separate streams are developed to maintain and learn invariant and equivariant representations. Carefully designed cross-attention modules bridge the two streams, allowing information fusion and enhancing geometric modeling in each stream. As a general and flexible architecture, we show that many previous architectures can be viewed as special instantiations of GeoMFormer. Extensive experiments are conducted to demonstrate the power of GeoMFormer. All empirical results show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales. Code and models will be made publicly available. |
Tianlang Chen · Shengjie Luo · Di He · Shuxin Zheng · Tie-Yan Liu · Liwei Wang 🔗 |
-
|
Scalable Deep Potentials as Implicit Hierarchical Semi-Separable Operators
(
Poster
)
>
link
Direct application of Transformer architectures in scientific domains poses computational challenges, due to quadratic scaling in the number of inputs. In this work, we propose an alternative method based on hierarchical semi-separable matrices (HSS), a class of rank-structured operators with linear-time evaluation algorithms. Through connections between linearized attention and HSS, we devise an implicit hierarchical parametrization strategy that interpolates between linear and quadratic attention, achieving both subquadratic scaling and high accuracy. We demonstrate the effectiveness of the proposed approach on the approximation of potentials from computational physics. |
Michael Poli · Stefano Massaroli · Christopher Ré · Stefano Ermon 🔗 |
-
|
Predictive Uncertainty Quantification for Graph Neural Network Driven Relaxed Energy Calculations
(
Poster
)
>
link
Graph neural networks (GNNs) have been shown to be astonishingly capable models for molecular property prediction, particularly as surrogates for expensive density functional theory calculations of relaxed energy for novel material discovery. However, one limitation of GNNs in this context is the lack of useful uncertainty prediction methods, as this is critical to the material discovery pipeline. In this work, we show that uncertainty quantification for relaxed energy calculations is more complex than uncertainty quantification for other kinds of molecular property prediction, due to the effect that structure optimizations have on the error distribution. We propose that distribution-free techniques are more useful tools for assessing calibration, recalibrating, and developing uncertainty prediction methods for GNNs performing relaxed energy calculations. We also develop a relaxed energy task for evaluating uncertainty methods for equivariant GNNs, based on distribution-free recalibration and using the Open Catalyst Project dataset. We benchmark a set of popular uncertainty prediction methods on this task, and show that latent distance methods, with our novel improvements, are the most well-calibrated and economical approach for relaxed energy calculations. Further, we challenge the community to develop improved uncertainty prediction methods for GNN-driven relaxed energy calculations, and benchmark them on this task. |
Joseph Musielewicz · Janice Lan · Matt Uyttendaele 🔗 |
-
|
GFN-SR: Symbolic Regression with Generative Flow Networks
(
Poster
)
>
link
Symbolic regression (SR) is an area of interpretable machine learning that aims to identify mathematical expressions, often composed of simple functions, that best fit in a given set of covariates $X$ and response $y$. In recent years, deep symbolic regression (DSR) has emerged as a popular method in the field by leveraging deep reinforcement learning to solve the complicated combinatorial search problem. In this work, we propose an alternative framework (GFN-SR) to approach SR with deep learning. We model the construction of an expression tree as traversing through a directed acyclic graph (DAG) so that GFlowNet can learn a stochastic policy to generate such trees sequentially. Enhanced with an adaptive reward baseline, our method is capable of generating a diverse set of best-fitting expressions. Notably, we observe that GFN-SR outperforms other SR algorithms in noisy data regimes, owing to its ability to learn a distribution of rewards over a space of candidate solutions.
|
Sida Li · Ioana Marinescu · Sebastian Musslick 🔗 |
-
|
Machine Learning Force Fields with Data Cost Aware Training
(
Poster
)
>
link
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as $O(n^3)$ to $O(n^7)$, with $n$ proportional to the number of basis functions.To address this issue, we propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting tahe potential bias of this data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabeled. Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID.
|
Alexander Bukharin · Tianyi Liu · Shengjie Wang · Simiao Zuo · Weihao Gao · Wen Yan · Tuo Zhao 🔗 |
-
|
Learning Interatomic Potentials at Multiple Scales
(
Poster
)
>
link
The need to use a short time step is a key limit on the speed of molecular dynamics (MD) simulations. Simulations governed by classical potentials are often accelerated by using a multiple-time-step (MTS) integrator that evaluates certain potential energy terms that vary more slowly than others less frequently. This approach is enabled by the simple but limiting analytic forms of classical potentials. Machine learning interatomic potentials (MLIPs), in particular recent equivariant neural networks, are much more broadly applicable than classical potentials and can faithfully reproduce the expensive but accurate reference electronic structure calculations used to train them. They still, however, require the use of a single short time step, as they lack the inherent term-by-term scale separation of classical potentials. This work introduces a method to learn a scale separation in complex interatomic interactions by co-training two MLIPs. Initially, a small and efficient model is trained to reproduce short-time-scale interactions. Subsequently, a large and expressive model is trained jointly to capture the remaining interactions not captured by the small model. When running MD, the MTS integrator then evaluates the smaller model for every time step and the larger model less frequently, accelerating simulation. Compared to a conventionally trained MLIP, our approach can achieve a significant speedup (~3x in our experiments) without a loss of accuracy on the potential energy or simulation-derived quantities. |
Xiang Fu · Albert Musaelian · Anders Johansson · Tommi Jaakkola · Boris Kozinsky 🔗 |
-
|
DynamicsDiffusion: Generating and Rare Event Sampling of Molecular Dynamic Trajectories Using Diffusion Models
(
Poster
)
>
link
Molecular dynamics simulations are fundamental tools for quantitative molecular sciences. However, these simulations are computationally demanding and often struggle to sample rare events crucial for understanding spontaneous organization and reconfiguration in complex systems. To improve general speed and the ability to sample rare events in a directed fashion, we propose a method called $\textit{DynamicsDiffusion}$ based on denoising diffusion probabilistic models (DDPM) to generate molecular dynamics trajectories from noise. The generative model can then serve as a surrogate to sample rare events. We leverage the properties of DDPMs, such as conditional generation, the ability to generate variations of trajectories, and those with certain conditions, such as crossing from one state to another, using the 'inpainting' property of DDPMs, which became only applicable when generating whole trajectories and not just individual conformations. To our knowledge, this is the first deep generative modeling for generating molecular dynamics trajectories. We hope this work will motivate a new generation of generative modeling for the study of molecular dynamics.
|
Magnus Petersen · Gemma Roig · Gemma Roig · Roberto Covino · Roberto Covino 🔗 |
-
|
FSscore: A Machine Learning-based Synthetic Feasibility Score Leveraging Human Expertise
(
Poster
)
>
link
Determining whether a molecule can be synthesized is crucial for many aspects of chemistry and drug discovery, allowing prioritization of experimental work and ranking molecules in de novo design tasks. Existing scoring approaches to assess synthetic feasibility struggle to extrapolate to out-of-distribution chemical spaces or fail to discriminate based on minor differences such as chirality that might be obvious to trained chemists. This work aims to address these limitations by introducing the Focused Synthesizability score~(FSscore), which learns to rank structures based on binary preferences using a graph attention network. First, a baseline trained on an extensive set of reactant-product pairs is established that subsequently is fine-tuned with expert human feedback on a chemical space of interest. Fine-tuning on focused datasets improves performance on these chemical scopes over the pre-trained model exhibiting moderate performance and generalizability. This enables distinguishing hard- from easy-to-synthesize molecules and improving the synthetic accessibility of generative model outputs. On very complex scopes with limited labels achieving satisfactory gains remains challenging. The FSscore showcases how human expert feedback can be utilized to optimize the assessment of synthetic feasibility for a variety of applications. |
Rebecca Neeser · Bruno Correia · Philippe Schwaller 🔗 |
-
|
Bi-level Graphs for Cellular Pattern Discovery
(
Poster
)
>
link
The tumor microenvironment is widely recognized for its central role in driving cancer progression and influencing prognostic outcomes. Despite extensive research efforts dedicated to characterizing this complex and heterogeneous environment, considerable challenges persist. In this study, we introduce a novel data-driven approach for identifying tumor microenvironment patterns that, we show, are closely tied to patient prognoses. Our methodology relies on the construction of a bi-level graph model: (i) a cellular graph, which models the intricate tumor microenvironments, and (ii) a population graph that captures inter-patient similarities, given their respective cellular graphs, by means of a soft Weisfeiler-Lehman kernel. This systematic integration of information across different scales enables us to identify patient subgroups exhibiting unique prognoses while unveiling certain tumor microenvironment patterns that characterize them. We demonstrate our approach in a cohort of breast cancer patients, identify crucial tumor microenvironment patterns associated with patient prognosis, and validate these patterns in a completely independent cohort. Our study provides valuable insights into the prognostic implications of the breast tumor microenvironment, and this methodology holds the potential to analyze other cancers. |
Zhenzhen Wang · Aleksander Popel · Jeremias Sulam 🔗 |
-
|
Retro-fallback: retrosynthetic planning in an uncertain world
(
Poster
)
>
link
Retrosynthesis is the task of proposing a series of chemical reactions to create a desired molecule from simpler, buyable molecules. While previous works have proposed algorithms to find optimal solutions for a range of metrics (e.g. shortest, lowest-cost), these works generally overlook the fact that we have imperfect knowledge of the space of possible reactions, meaning plans created by the algorithm may not work in a laboratory. In this paper we propose a novel formulation of retrosynthesis in terms of stochastic processes to account for this uncertainty. We then propose a novel greedy algorithm called retro-fallback which maximizes the probability that at least one synthesis plan can be executed in the lab. Using in-silico benchmarks we demonstrate that retro-fallback generally produces better sets of synthesis plans than the popular MCTS and retro* algorithms. |
Austin Tripp · Krzysztof Maziarz · Sarah Lewis · Marwin Segler · José Miguel Hernández-Lobato 🔗 |
-
|
A Framework for Toxic PFAS Replacement based on GFlowNet and Chemical Foundation Model
(
Poster
)
>
link
Per- and polyfluoroalkyl substances (PFAS) are a broad class of molecules used in almost every sector of industry and consumer goods. PFAS exhibit highly desirable properties such as high durability, water repellance or high acidity, that are difficult to match. As a side effect, PFAS persist in the environment and have detrimental effect on human health. Epidemiological research has linked PFAS exposure to chronic health conditions, including dyslipidemia, cardiometabolic disorders, liver damage, and hypercholesterolemia. Recently, public health agencies significantly strengthed regulations on the use of PFAS. Therefore, alternatives are needed to maintain the pace of technological developments in multiple areas that traditionally relied on PFAS. To support the discovery of alternatives, we introduce MatGFN-PFAS, an AI system that generates PFAS replacements. We build MatGFN-PFAS using Generative Flow Networks (GFlowNets) for generation and a Chemical Language Model (MolFormer) for property prediction. We evaluate MatGFN-PFAS by exploring potential replacements of PFAS superacids, defined as molecules with negative pKa, that are critical for the semiconductor industry. It might be challenging to eliminate PFAS superacids entirely as a class due to the strong constraints on their functional performance. The proposed approach aims to account for this possibility and enables the generation of safer PFAS superacids as well. We evaluate two design strategies: 1) Using Tversky similarity to design molecules similar to a target PFAS but with lower toxicity and 2) Directly generating molecules with negative pKa and low toxicity. For the given query SMILE CC1CC(CC(F)(F)C(F)(F)OC(F)(F)C(F)(F)S(=O)(=O)O)OC1=O, the MatGFN-PFAS system was able to generate a candidate with very low toxicity, LD50 = 7304.23, strong acidity, pKa = -1.92, and high similarity score, 89.32%, to the query molecule. Results demonstrated that the proposed MatGFN-PFAS was able to consistently generate replacement molecules following all the constraints forehead mentioned. The resulting datasets for each studied molecule are available at anonymized. |
Eduardo Soares · Flaviu Cipcigan · Dmitry Zubarev · Emilio Vital Brazil 🔗 |
-
|
Fast and Scalable Inference of Dynamical Systems via Integral Matching
(
Poster
)
>
link
We present a novel approach to identifying parameters of nonlinear Ordinary Differential Equations (ODEs). This method, which is based on collocation methods, enables the direct identification of parameters from time series data by matching the integral of the dynamic with an interpolation of the trajectory. This method is distinct from existing literature in that it does not require ODE solvers or an estimate of the time derivative. Furthermore, batching strategies, such as time subintervals and component of the state, are proposed to improve scalability, thus providing a fast and highly parallel method to evaluate gradients, and a faster convergence than adjoint methods. The effectiveness of the method is demonstrated on chaotic systems, with speed-ups of three orders of magnitude compared to adjoint methods, and its robustness to observational noise and data availability is assessed. |
Baptiste Rossi · Dimitris Bertsimas 🔗 |
-
|
MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design
(
Poster
)
>
link
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture due to their exceptional porosity and tunable chemistry. Their modular nature has enabled the use of template-based methods to generate hypothetical MOFs by combining molecular building blocks in accordance with known network topologies. However, the ability of these methods to identify top-performing MOFs is often hindered by the limited diversity of the resulting chemical space. In this work, we propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures through a denoising diffusion process over the coordinates and identities of the building blocks. The all-atom MOF structure is then determined through a novel assembly algorithm. As the diffusion model generates 3D MOF structures by predicting scores in E(3), we employ equivariant graph neural networks that respect the permutational and roto-translational symmetries. We comprehensively evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials for carbon capture applications with molecular simulations. |
Xiang Fu · Tian Xie · Andrew Rosen · Tommi Jaakkola · Jake Smith 🔗 |
-
|
Rapid Prediction of Two-dimensional Airflow in an Operating Room using Scientific Machine Learning
(
Poster
)
>
link
We consider the problem of using scientific machine learning (SciML) to rapidly predict solutions to systems of nonlinear partial differential equations (PDEs) defined over complex geometries. In particular, we focus on modeling how airflow in operating rooms (ORs) is affected as the position of an object within the OR varies. We develop data-driven and physics-informed operator-learning models based on the deep operator network (DeepONet) architecture. The DeepONet models are able to accurately and rapidly predict airflow solutions to novel parameter configurations, and they surpass the accuracy of a random forest (RF) baseline. Interestingly, we find that physics-informed regularization (PIR) does not enhance model accuracy, partially because of misspecification of the physical prior compared to the data’s governing equations. Existing SciML models struggle in predicting flow when complex geometries determine localized behavior. |
Gary Collins · Alexander New · Ryan Darragh · Ryan Darragh · Brian Damit · Christopher Stiles 🔗 |
-
|
Latent Task-Specific Graph Network Simulators
(
Poster
)
>
link
Simulating dynamic physical interactions is a critical challenge across multiple scientific domains, with applications ranging from robotics to material science. For mesh-based simulations, Graph Network Simulators (GNSs) pose an efficient alternative to traditional physics-based simulators. Their inherent differentiability and speed make them particularly well-suited for inverse design problems.Yet, adapting to new tasks from limited available data is an important aspect for real-world applications that current methods struggle with.We frame mesh-based simulation as a meta-learning problem and use a recent Bayesian meta-learning method to improve GNSs adaptability to new scenarios by leveraging context data and handling uncertainties. Our approach, latent task-specific graph network simulator, uses non-amortized task posterior approximations to sample latent descriptions of unknown system properties. Additionally, we leverage movement primitives for efficient full trajectory prediction, effectively addressing the issue of accumulating errors encountered by previous auto-regressive methods. We validate the effectiveness of our approach through various experiments, performing on par with or better than established baseline methods. Movement primitives further allow us to accommodate various types of context data, as demonstrated through the utilization of point clouds during inference. By combining GNSs with meta-learning, we bring them closer to real-world applicability, particularly in scenarios with smaller datasets. |
Philipp Dahlinger · Niklas Freymuth · Tai Hoang · Michael Volpp · Gerhard Neumann 🔗 |
-
|
Citation-Similarity Relationships in Astrophysics Literature
(
Poster
)
>
link
We report a novel observation about which scientific publications are cited more frequently: those that are more textually similar to pre-existing publications. Using word-vector based document embeddings, we analyze quantitative trends for a large sample of publication abstracts in the field of astrophysics (N ~ 300,000). When new publications are ranked by how many similar publications already exist in their neighborhood, the number of citations per year that the upper 50th percentile receives is ~ 2 times that of the lower 50tb percentile. When new publications are ranked by an alternative metric of dissimilarity to neighbors, citations per year decrease by a factor of ~ 1.5 from the lower to the upper 50th percentile. We discuss a number of hypotheses that could explain this citation-similarity relationship relevant to the science of science. |
Nathaniel Imel · Zachary Hafen 🔗 |
-
|
Electron-Derived Molecular Representation Learning for Real-World Molecular Physics
(
Poster
)
>
link
Various representation learning methods for molecular structures have been devised to accelerate data-driven drug and materials discovery. However, the representation capabilities of existing methods are essentially limited to atom-level information, which is not sufficient to describe real-world molecular physics. Although electron-level information can provide fundamental knowledge about chemical compounds beyond the atom-level information, obtaining the electron-level information in real-world molecules is computationally impractical and sometimes infeasible. We propose a new method for learning electron-derived molecular representations without additional computation costs by transferring pre-calculated electron-level information about small molecules to large molecules of our interest. The proposed method achieved state-of-the-art prediction accuracy on extensive benchmark datasets containing experimentally observed molecular physics. |
Gyoung S. Na · Chanyoung Park 🔗 |
-
|
Coupling Semi-supervised Learning with Reinforcement Learning for Better Decision Making --- An application to Cryo-EM Data Collection
(
Poster
)
>
link
We consider a semi-supervised Reinforcement Learning (RL) approach that takes inputs from a perception model. Performance of such an approach can be significantly limited by the quality of the perception model in the low labeled data regime. We propose a novel iterative framework that simultaneously couples and improves the training of both RL and the perception model. The perception model takes pseudo labels generated from the trajectories of a trained RL agent believing that the decision-model can correct errors made by the perception model. We apply the framework to cryo-electron microscopy (cryo-EM) data collection, whose goal is to find as many high-quality micrographs taken by cryo-electron microscopy as possible by navigating at different magnification levels. Our proposed method significantly outperforms various baseline methods in terms of both RL rewards and the accuracy of the perception model. We further provide some theoretical insights into the benefits of coupling the decision model and the perception model by showing that RL-generated pseudo labels are biased towards localization which aligns with the underlying data generating mechanism. Our iterative framework that couples both sides of the semi-supervised RL can be applied to a wide range of sequential decision-making tasks when the labeled data is limited. |
Ziping Xu · Quanfu Fan · Yilai Li · Emma Lee · john cohn · Ambuj Tewari · Seychelle Vos · Michael Cianfrocco 🔗 |
-
|
Beam Enumeration: Probabilistic Explainability For Sample Efficient Self-conditioned Molecular Design
(
Poster
)
>
link
Generative molecular design has moved from proof-of-concept to real-world applicability, as marked by the surge in very recent papers reporting experimental validation. Key challenges in explainability and sample efficiency present opportunities to enhance generative design to directly optimize expensive high-fidelity oracles and provide actionable insights to domain experts. Here, we propose Beam Enumeration to exhaustively enumerate the most probable sub-sequences from language-based molecular generative models and show that molecular substructures can be extracted. When coupled with reinforcement learning, extracted substructures become meaningful, providing a source of explainability and improving sample efficiency through self-conditioned generation. Beam Enumeration is generally applicable to any language-based molecular generative model and notably further improves the performance of the recently reported Augmented Memory algorithm, which achieved the new state-of-the-art on the Practical Molecular Optimization benchmark for sample efficiency. The combined algorithm generates more high reward molecules and faster, given a fixed oracle budget. Beam Enumeration is the first method to jointly address explainability and sample efficiency for molecular design. |
Jeff Guo · Philippe Schwaller 🔗 |
-
|
DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies
(
Poster
)
>
link
In the next decade, deep learning may revolutionize the natural sciences, enhancing our capacity to model and predict natural occurrences. This could herald a new era of scientific exploration, bringing significant advancements across sectors from drug development to renewable energy. To answer this call, we present DeepSpeed4Science initiative which aims to build unique capabilities through AI system technology innovations to help domain experts to unlock today’s biggest science mysteries. By leveraging DeepSpeed’s current technology pillars (training, inference and compression) as base technology enablers, DeepSpeed4Science will create a new set of AI system technologies tailored for accelerating scientific discoveries by addressing their unique complexity beyond the common technical approaches used for accelerating generic large language models (LLMs). In this paper, we showcase the early progress we made with DeepSpeed4Science in addressing two of the critical system challenges in structural biology research. |
Shuaiwen Song · Bonnie Kruft · Minjia Zhang · Conglong Li · Shiyang Chen · Chengming Zhang · Masahiro Tanaka · Xiaoxia Wu · Mohammed AlQuraishi · Gustaf Ahdritz · Christina Floristean · Rick Stevens · Venkatram Vishwanath · Arvind Ramanathan · Sam Foreman · Kyle Hippe · Prasanna Balaprakash · Yuxiong He
|
-
|
Seeking Truth and Beauty in Flavor Physics with Machine Learning
(
Poster
)
>
link
The discovery process of building new theoretical physics models involves the dual aspect of both fitting to the existing experimental data and satisfying abstract theorists' criteria like beauty, naturalness, etc. We design loss functions for performing both of those tasks with machine learning techniques. We use the Yukawa quark sector as a toy example to demonstrate that the optimization of these loss functions results in true and beautiful models. |
Konstantin Matchev · Katia Matcheva · Pierre Ramond · Sarunas Verner 🔗 |
-
|
Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks
(
Poster
)
>
link
Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models. |
Yanqiao Zhu · Jeehyun Hwang · Keir Adams · Zhen Liu · Bozhao Nan · Brock Stenfors · Yuanqi Du · Jatin Chauhan · Olaf Wiest · Olexandr Isayev · Connor Coley · Yizhou Sun · Wei Wang
|
-
|
Towards stable real-world equation discovery with assessing differentiating quality influence
(
Poster
)
>
link
This paper explores the critical role of differentiation approaches for data-driven differential equation discovery. Accurate derivatives of the input data are essential for reliable algorithmic operation, particularly in real-world scenarios where measurement quality is inevitably compromised. We propose alternatives to the commonly used finite differences-based method, notorious for its instability in the presence of noise, which can exacerbate random errors in the data. Our analysis covers four distinct methods: Savitzky-Golay filtering, spectral differentiation, smoothing based on artificial neural networks, and the regularization of derivative variation. We evaluate these methods in terms of applicability to problems, similar to the real ones, and their ability to ensure the convergence of equation discovery algorithms, providing valuable insights for robust modeling of real-world processes. |
Mikhail Masliaev · Ilya Markov · Ilya Markov · Alexander Hvatov · Alexander Hvatov 🔗 |
-
|
Optimizing Markov Chain Monte Carlo Convergence with Normalizing Flows and Gibbs Sampling
(
Poster
)
>
link
Generative models have started to integrate into the scientific computing toolkit. One notable instance of this integration is the utilization of normalizing flows (NF) in the development of sampling and variational inference algorithms. This work introduces a novel algorithm, GflowMC, which relies on a Metropolis-within-Gibbs framework within the latent space of NFs. This approach addresses the challenge of vanishing acceptance probabilities often encountered when using NF-generated independent proposals, while retaining non-local updates, enhancing its suitability for sampling multi-modal distributions. We assess GflowMC's performance concentrating on the $\phi^4$ model from statistical mechanics.Our results demonstrate that by identifying an optimal size for partial updates, convergence of the Markov Chain Monte Carlo (MCMC) can be achieved faster than with full updates. Additionally, we explore the adaptability of GflowMC for biasing proposals towards increasing the update frequency of critical coordinates, such as coordinates highly correlated to mode switching in multi-modal targets.
|
Christoph Schönle · Marylou Gabrié 🔗 |
-
|
Latent Neural PDE Solver for Time-dependent Systems
(
Poster
)
>
link
Neural networks have shown promising potential in accelerating the numerical simulation of systems governed by partial differential equations (PDEs). While many of the existing neural network surrogates operate on the high-dimensional discretized field, we propose to learn the dynamics of the system in the latent space with much coarser discretization. A non-linear autoencoder is trained first to project the full-order representation of the system onto the mesh-reduced space, then another temporal model is trained to predict the future state in this mesh-reduced space. This reduction process eases the training of the temporal model as it greatly reduces the computational cost induced by high-resolution discretization. We study the capability of the proposed framework on 2D/3D fluid flow and showcase that it has competitive performance compared to the model that operates on full-order space. |
Zijie Li · Saurabh Patil · Dule Shu · Amir Barati Farimani 🔗 |
-
|
Learning Temporal Higher-order Patterns to Detect Anomalous Brain Activity
(
Poster
)
>
link
Due to recent advances in machine learning on graphs, representing the connections of the human brain as a network has become one of the most pervasive analytical paradigms. However, most existing graph machine learning-based methods suffer from a subset of five critical limitations: They are (1) designed for simple pair-wise interactions while recent studies on the human brain show the existence of higher-order dependencies of brain regions, (2) designed to perform on pre-constructed networks from time-series data, which limits their generalizability, (3) designed for classifying brain networks, limiting their ability to reveal underlying patterns that might cause the symptoms of a disease or disorder, (4) designed for learning of static patterns, missing the dynamics of human brain activity, and (5) designed in supervised setting, relying their performance on the existence of labeled data. To address these limitations, we present HADiB, an end-to-end anomaly detection model that automatically learns the structure of the hypergraph representation of the brain from neuroimage data. HADiB uses a tetra-stage message-passing mechanism along with an attention mechanism that learns the importance of higher-order dependencies of brain regions. We further present a new adaptive hypergraph pooling to obtain brain-level representation, enabling HADiB to detect the neuroimage of people living with a specific disease or disorder. Our experiments on Parkinson’s Disease, Attention Deficit Hyperactivity Disorder, and Autism Spectrum Disorder show the efficiency and effectiveness of our approaches in detecting anomalous brain activity. |
Ali Behrouz · Farnoosh Hashemi 🔗 |
-
|
Expression Sampler as a Dynamic Benchmark for Symbolic Regression
(
Poster
)
>
link
Equation discovery, the problem of identifying mathematical expressions from data, has witnessed the emergence of symbolic regression (SR) techniques aided by benchmarking systems like SRbench. However, these systems are limited by their reliance on static expressions and datasets, which, in turn, provides limited insight into the circumstances under which SR algorithms perform well versus fail. To address this issue, we introduce an open-source method for generating comprehensive SR datasets via random sampling of mathematical expressions. This method enables dynamic expression sampling while controlling for various expression characteristics pertaining to expression complexity. The method also allows for using prior information about expression distributions, for example, to simulate expression distributions for a specific scientific domain. Using this dynamic benchmark, we demonstrate that the overall performance of established SR algorithms decreases with expression complexity and provide insight into which equation features are best recovered. Our results suggest that most SR algorithms overestimate the number of expression tree nodes and trigonometric functions and underestimate the number of input variables present in the ground truth. |
Ioana Marinescu · Younes Strittmatter · Chad Williams · Sebastian Musslick 🔗 |
-
|
What a Scientific Language Model Knows and Doesn't Know about Chemistry
(
Poster
)
>
link
Large Language Models (LLMs) show promise to change how we can interact with and control the design of other modalities, such as drugs, materials, and proteins, and enable scientific reasoning and planning. However, LLMs have several weaknesses: they tend to memorize instead of understand, and the implicit knowledge does not always propagate well between semantically similar inputs. In this work, we seek to distinguish what these scientific LLMs have memorized versus what they actually understand. To do so, we propose a new comprehensive benchmark dataset to evaluate LLM performance on molecular property prediction. We consider Galactica 1.3B, a state-of-the-art scientific LLM, and find that different prompting strategies exhibit vastly different error rates. We find that in-contextlearning generally improves performance over zero-shot prompting, and the effect is twice as great for computed properties than for experimental. Furthermore, we show the model is brittle and relies on memorized information, which may limit the application of LLMs for controlling molecular discovery. Based on these findings, we suggest the development of novel methods to enhance information propagation within LLMs—if we desire LLMs to help us control molecular design and the scientific process, then they must learn a sufficient understanding of how molecules work in the real world. |
Lawrence Zhao · Carl Edwards · Heng Ji 🔗 |
-
|
Beyond MD17: The xxMD Dataset as a Chemically Meaningful Benchmark for Neural Force Fields Development
(
Poster
)
>
link
Neural force fields (NFFs) have gained prominence in computational chemistry as surrogate models, superseding quantum-chemistry calculations in ab initio molecular dynamics. The prevalent benchmark for NFFs has been the MD17 dataset and its subsequent extension. These datasets predominantly comprise geometries from the equilibrium region of the ground electronic state potential energy surface, sampling from direct adiabatic dynamics. However, many chemical reactions entail significant molecular deformations, notably bond breaking. We demonstrate the constrained distribution of internal coordinates and energies in the MD17 datasets, underscoring their inadequacy for representing systems undergoing chemical reactions. Addressing this sampling limitation, we introduce the xxMD (Extended Excited-state Molecular Dynamics) dataset, derived from non-adiabatic dynamics. This dataset encompasses energies and forces ascertained from both multireference wave function theory and density functional theory. Furthermore, its nuclear configuration spaces authentically depict chemical reactions, making xxMD a more chemically relevant dataset. Our re-assessment of equivariant models on the xxMD datasets reveals notably higher mean absolute errors than those reported for MD17 and its variants. This observation underscores the challenges faced in crafting a generalizable NFF model with extrapolation capability. |
Zihan Pengmei · Junyu Liu · Yinan Shu 🔗 |
-
|
MUBen: Benchmarking the Uncertainty of Molecular Representation Models
(
Poster
)
>
link
Large molecular representation models pre-trained on massive unlabeled data have shown great success in predicting molecular properties. However, these models may tend to overfit the fine-tuning data, resulting in over-confident predictions on test data that fall outside of the training distribution. To address this issue, uncertainty quantification (UQ) methods can be used to improve the models' calibration of predictions. Although many UQ approaches exist, not all of them lead to improved performance. While some studies have included UQ to improve molecular pre-trained models, the process of selecting suitable backbone and UQ methods for reliable molecular uncertainty estimation remains underexplored. To address this gap, we present MUBen, which evaluates different UQ methods for state-of-the-art backbone molecular representation models to investigate their capabilities. By fine-tuning various backbones using different molecular descriptors as inputs with UQ methods from different categories, we critically assess the influence of architectural decisions and training strategies. Our study offers insights for selecting UQ for backbone models, which can facilitate research on uncertainty-critical applications in fields such as materials science and drug discovery. |
Yinghao Li · Yinghao Li · Lingkai Kong · Lingkai Kong · Yuanqi Du · Yuanqi Du · Yue Yu · Yuchen Zhuang · Yuchen Zhuang · Wenhao Mu · Wenhao Mu · Chao Zhang · Chao Zhang 🔗 |
-
|
Transformers are efficient hierarchical chemical graph learners
(
Poster
)
>
link
Transformers, adapted from natural language processing, are emerging as a leading approach for graph representation learning. Current graph transformers generally treat each node or edge as an individual token, which can become computationally expensive for graphs of even moderate size owing to the quadratic scaling with token count of the computational complexity of self-attention.In this paper, we introduce SubFormer, a graph transformer that operates on subgraphs that aggregate information by a message-passing mechanism. This approach reduces the number of tokens and enhances learning long-range interactions. We demonstrate SubFormer on benchmarks for predicting molecular properties from chemical structures and show that it is competitive with state-of-the-art graph transformers at a fraction of the computational cost, with training times on the order of minutes on a consumer-grade graphics card. We interpret the attention weights in terms of chemical structures. We show that SubFormer exhibits limited over-smoothing and avoids over-squashing, which is prevalent in traditional graph neural networks. |
Zihan Pengmei · Zimu Li · Chih-chan Tien · Risi Kondor · Aaron Dinner 🔗 |
-
|
Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs
(
Poster
)
>
link
Understanding protein function is vital for drug discovery, disease diagnosis, and protein engineering. While Protein Language Models (PLMs) pre-trained on vast protein sequence datasets have achieved remarkable success, equivalent Protein Structure Models (PSMs) remain underrepresented. We attribute this to the relative lack of high-confidence structural data and suitable pre-training objectives. In this context, we introduce BioCLIP, a contrastive learning framework that pre-trains PSMs by leveraging PLMs, generating meaningful per-residue and per-chain structural representations. When evaluated on tasks such as protein-protein interaction, Gene Ontology annotation, and Enzyme Commission number prediction, BioCLIP-trained PSMs consistently outperform models trained from scratch and further enhance performance when merged with sequence embeddings. Notably, BioCLIP approaches, or exceeds, specialized methods across all benchmarks using its singular pre-trained design. Our work addresses the challenges of obtaining quality structural data and designing self-supervised objectives, setting the stage for more comprehensive models of protein function. Source code is publicly available. |
Louis Robinson · Timothy Atkinson · Liviu Copoiu · Patrick Bordes · Thomas PIERROT · Tom Barrett 🔗 |
-
|
Text2Decision: Decoding Latent Variables in Risky Decision Making from Think Aloud Text
(
Poster
)
>
link
Understanding human thoughts can be difficult, as scientists usually rely on observing behaviors. The think-aloud protocol, where people talk about their thoughts while making decisions, provides a more direct way to study thoughts. However, past research on this topic has mostly been qualitative. Recent advancements in artificial intelligence and natural language processing provide the potential for more quantitative analysis of language data. This study introduces Text2Decision, a model trained on task questions from a large-scale task collection, used to decode decision tendencies in risky decision-making from think-aloud texts. We test our model in both human and GPT-4 simulated think-aloud text data about risky decision-making, which are out-of-distributed in the training. Our findings demonstrate the model's performance in capturing GPT-4 manipulated decision personas and in unveiling heuristic decision tendencies from humans. Text2Decision demonstrates its capability by training on basic task outlines and theoretical frameworks and generalizing to unseen empirical think-aloud text data. This not only allows decoding individual differences from these texts but also extends to analyzing large-scale domain datasets. This study shed light on AI integration in cognitive research for the AI4Science paradigm. |
Hanbo Xie · Huadong Xiong · Robert Wilson 🔗 |
-
|
SBMLtoODEjax: Efficient Simulation and Optimization of Biological Network Models in JAX
(
Poster
)
>
link
Advances in bioengineering and biomedicine demand a deep understanding of the dynamic behavior of biological systems, ranging from protein pathways to complex cellular processes. Biological networks like gene regulatory networks and protein pathways are key drivers of embryogenesis and physiological processes. Comprehending their diverse behaviors is essential for tackling diseases, including cancer, as well as for engineering novel biological constructs. Despite the availability of extensive mathematical models represented in Systems Biology Markup Language (SBML), researchers face significant challenges in exploring the full spectrum of behaviors and optimizing interventions to efficiently shape those behaviors. Existing tools designed for simulation of biological network models are not tailored to facilitate interventions on network dynamics nor to facilitate automated discovery. Leveraging recent developments in machine learning (ML), this paper introduces SBMLtoODEjax, a lightweight library designed to seamlessly integrate SBML models with ML-supported pipelines, powered by JAX. SBMLtoODEjax facilitates the reuse and customization of SBML-based models, harnessing JAX's capabilities for efficient parallel simulations and optimization, with the aim to accelerate research in biological network analysis. |
Mayalen Etcheverry · Michael Levin · Clément Moulin-Frier · Pierre-Yves Oudeyer 🔗 |
-
|
Reinforcement Learning-Enabled Environmentally Friendly and Multi-functional Chrome-looking Plating
(
Oral
)
>
link
Although decorative chrome plating (DCP) is ubiquitous in metal finishings and coatings, the industrial process of chromium deposition is fraught with adverse health effects for the workers involved and causes environmental pollution. In this work, we seek to find an environmentally friendly replacement to DCP by mimicking the chrome color used for decoration. To discover a suitable replacement efficiently, we employ a reinforcement learning (RL) algorithm to perform an automatic inverse design in optical multilayer thin film structures. The RL algorithm successfully figures out two different structures with environmentally friendly materials while still showing a chrome color. One structure is further designed to have high transmission in the radio frequency regime, a property that general metals cannot have, which can broaden the decorative chrome applications to include microwave operating devices. We also experimentally fabricate these structures and validate their performance. |
Taigao Ma · Anwesha Saha · L. Jay Guo · Haozhu Wang 🔗 |
-
|
CHARM: Creating Halos with Auto-Regressive Multi-stage networks
(
Poster
)
>
link
To maximize the amount of information extracted from cosmological datasets, simulations that accurately represent these observations are necessary. However, traditional simulations that evolve particles under gravity by estimating particle-particle interactions (N-body simulations) are computationally expensive and prohibitve to scale to the large volumes and resolutions necessary for the upcoming datasets. Moreover, modeling the distribution of galaxies typically involves identifying collapsed and bound dark matter structures called halos. This is also a time-consuming process for large N-body simulations, further exacerbating the computational cost. In this study, we introduce CHARM, a novel method for creating mock halo catalogs by matching the spatial and mass statistics of halos directly from the large-scale distribution of dark matter density field. We develop multi-stage neural spline flow based networks to learn this mapping directly with computationally cheaper, approximate dark matter simulations instead of relying on the full N-body simulations. We validate that the mock halo catalogs have same statistical properties as obtained from traditional methods. Our method effectively provides a speed-up of more than a factor of 1000 in creating reliable mock halo catalogs compared to conventional approaches. This study represents a major first step towards being able to analyze the non-Gaussian and non-linear information from current-generation surveys using simulation-based inference approaches on the massive scales of upcoming surveys. |
Shivam Pandey · Chirag Modi · Benjamin Wandelt · Guilhem Lavaux 🔗 |
-
|
Unleashing the Autoconversion Rates Forecasting: Evidential Regression from Satellite Data
(
Poster
)
>
link
High-resolution simulations such as the ICOsahedral Non-hydrostatic Large-Eddy Model (ICON-LEM) can be used to understand the interactions between aerosols, clouds, and precipitation processes that currently represent the largest source of uncertainty involved in determining the radiative forcing of climate change. Nevertheless, due to the exceptionally high computing cost required, this simulation-based approach can only be employed for a short period of time within a limited area. Despite the fact that machine learning can solve this problem, the related model uncertainties may make it less reliable. To address this, we developed a neural network (NN) model powered with evidential learning to assess the data and model uncertainties applied to satellite observation data. Our study focuses on estimating the rate at which small droplets (cloud droplets) collide and coalesce to become larger droplets (raindrops) – autoconversion rates -- since this is one of the key processes in the precipitation formation of liquid clouds, hence crucial to better understanding cloud responses to anthropogenic aerosols. The results of estimating the autoconversion rates demonstrate that the model performs reasonably well, with the inclusion of both aleatoric and epistemic uncertainty estimation, which improves the credibility of the model and provides useful insights for future improvement. |
Maria Carolina Novitasari · Johannes Quaas · Miguel Rodrigues 🔗 |
-
|
Role of Structural and Conformational Diversity for Machine Learning Potentials
(
Poster
)
>
link
In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts. |
Nikhil Shenoy · Prudencio Tossou · Emmanuel Noutahi · Hadrien Mary · Dominique Beaini · Jiarui Ding 🔗 |
-
|
Immunology Meets Artificial Intelligence: Expanding Our Scientific Toolbox
(
Poster
)
>
link
Artificial intelligence (AI) is now a part of our daily lives. In this swiftly evolving landscape, AI has become an indispensable tool in the scientific discovery process, augmenting tasks from ideation and hypothesis generation to data cleaning, code development and debugging, text editing, and data analysis. This paper advocates for educational resources for AI in immunology, emphasizing its unique position to leverage AI's potential for scientific discovery. Immunology's intricate tapestry spans multiple biological scales, from molecular interactions to complex systems, presenting an ideal canvas for AI-driven solutions. The field is rich in data, thanks to advanced molecular and single-cell technologies, making it ripe for AI-driven insights. To support the intersection of AI and immunology, we've established a dedicated website as an AI resource hub, offering curated modules and resources. By fostering a "learn by playing" ethos, promoting interactive and engaging workshops, and inviting community contributions, we aim to empower immunologists to harness AI's transformative capabilities and navigate this exciting frontier collectively. |
Van Q. Truong · Matthew Lee · Dokyoon Kim · John Wherry · Marylyn Ritchie 🔗 |
-
|
SpatialSSL: Whole-Brain Spatial Transcriptomics in the Mouse Brain with Self-Supervised Learning
(
Poster
)
>
link
Self-supervised learning (SSL) is a rich framework for obtaining meaningful data representations across large datasets. While SSL shows impressive results in computer vision and natural language processing, the single-cell field's diverse applications still need to be explored. We study SSL for the application of cell classification in cellular neighborhoods of spatially-resolved single-cell RNA-sequencing data. To address this, we developed an SSL framework on spatial molecular profiling data, integrating a cell's molecular expression and spatial location within a tissue slice. We demonstrate our methods on a large-scale whole mouse brain atlas, recording the gene expression measurements of 550 genes in 4,334,174 individual cells across 59 discrete tissue slices from the entire mouse brain. Our empirical study suggests that SSL improves downstream performance, especially in the presence of class imbalances. Notably, we observe a more substantial performance improvement on the sub-graph level than the full-graph level. |
Till Richter · Anna Schaar · Francesca Drummer · Cheng-Wei Liao · Leopold Endres · Fabian Theis 🔗 |
-
|
Learning to Relax: Setting Solver Parameters Across a Sequence of Linear System Instances
(
Poster
)
>
link
Solving a linear system $\mathbf{Ax}=\mathbf{b}$ is a fundamental scientific computing primitive, and numerous solvers and preconditioners have been developed. These come with parameters whose optimal values depend on the system being solved and are often impossible or expensive to identify; thus in practice sub-optimal heuristics are used. We consider the common setting in which many related linear systems are solved, e.g. numerical simulation. In this scenario, can we sequentially choose parameters that attain a near-optimal overall number of iterations, without extra matrix computations? We answer in the affirmative for Successive Over-Relaxation (SOR), a standard solver whose parameter $\omega$ has a strong impact on its runtime. For this method, we prove that a bandit online learning algorithm—using only the number of iterations as feedback—can select parameters for a sequence of instances such that the overall cost approaches that of the best fixed $\omega$ as the sequence length increases. Furthermore, when given additional structural information, we show that a *contextual* bandit method asymptotically achieves the performance of the *instance-optimal* policy, which selects the best $\omega$ for each instance. Our work provides the first learning-theoretic treatment of high-precision linear system solvers and the first end-to-end guarantees for data-driven scientific computing, demonstrating theoretically the potential to speed up numerical methods using well-understood learning algorithms.
|
Misha Khodak · Edmond Chow · Maria-Florina Balcan · Ameet Talwalkar 🔗 |
-
|
AstroCLIP: Cross-Modal Pre-Training for Astronomical Foundation Models
(
Poster
)
>
link
We present AstroCLiP, a strategy to facilitate the construction of astronomical foundation models that bridge the gap between diverse astronomical observational modalities. We demonstrate that a cross-modal contrastive learning approach between images and spectra of galaxies yields highly informative embeddings of both modalities. In particular, we apply our method on multi-band images and spectrograms from the Dark Energy Spectroscopic Instrument (DESI), and show that: (1) these embeddings are well-aligned between modalities and can be used for accurate cross-modal searches, and (2) these embeddings encode valuable physical information about the galaxies - in particular redshift and stellar mass - that can be used to achieve competitive zero- and few- shot predictions without further finetuning. Additionally, in the process of developing our approach, we also construct a novel, transformer-based model and pretraining approach for galaxy spectra. |
Francois Lanusse · Liam Parker · Siavash Golkar · Alberto Bietti · Miles Cranmer · Michael Eickenberg · Geraud Krawezik · John McCabe · Ruben Ohana · Mariel Pettee · Bruno Régaldo-Saint Blancard · Tiberiu Tesileanu · Kyunghyun Cho · Shirley Ho
|
-
|
Modelling biology in novel ways - an AI-first course in Structural Bioinformatics
(
Poster
)
>
link
In recent years, there has been tremendous progress in applying data-driven methodologies to study biological questions. The rapidly evolving field of machine learning has gained a plethora of methods that can be applied to structural biology like protein structure prediction. However, the intricacies one faces when analyzing complex biological data are sometimes underappreciated in applications of machine learning methods.On the other hand, biologists often face a language- and method barrier when trying to understand and correctly apply machine learning tools. As a result, they might be using such methods without proper expertise, potentially resulting in incorrect predictions and questionable conclusions about the resulting data.To help remedy these issues, we have developed a holistic 11-unit course in AI-driven Structural Bioinformatics with the aim of (i) encouraging machine learning researchers to learn more about the biological complexity of the data they are analyzing and (ii) allowing biologists to better understand state-of-the-art machine learning algorithms for correct application to biological systems.The course includes video lectures, animated visualisations as well as in-depth exercises and further resources for each of the topics discussed. We hope that our course stimulates collaboration across research communities and lowers the entry barrier for newcomers to understand and investigate structural biology with data-driven tools. Our course is available at \url{https://structural-bioinformatics.netlify.app}. |
Kieran Didi · Charles Harris · Pietro Lió · Rainer Beck 🔗 |
-
|
ChemGymRL: An Interactive Framework for Reinforcement Learning for Digital Chemistry
(
Poster
)
>
link
This paper provides a simulated laboratory for making use of Reinforcement Learning (RL) for chemical discovery. Since RL is fairly data intensive, training agents `on-the-fly' by taking actions in the real world is infeasible and possibly dangerous. Moreover, chemical processing and discovery involves challenges which are not commonly found in RL benchmarks and therefore offer a rich space to work in. We introduce a set of highly customizable and open-source RL environments, \textbf{ChemGymRL}, implementing the standard Gymnasium API. ChemGymRL supports a series of interconnected virtual chemical \emph{benches} where RL agents can operate and train. The paper introduces and details each of these benches using well-known chemical reactions as illustrative examples, and trains a set of standard RL algorithms in each of these benches. Finally, discussion and comparison of the performances of several standard RL methods are provided in addition to a list of directions for future work as a vision for the further development and usage of ChemGymRL. |
Chris Beeler · Sriram Ganapathi · Colin Bellinger · Mark Crowley · Isaac Tamblyn 🔗 |
-
|
ExPT: Synthetic Pretraining for Few-Shot Experimental Design
(
Poster
)
>
link
Experimental design for optimizing black-box functions is a fundamental problem in many science and engineering fields. In this problem, sample efficiency is crucial due to the time, money, and safety costs of real-world design evaluations. Existing approaches either rely on active data collection or access to large, labeled datasets of past experiments, making them impractical in many real-world scenarios. In this work, we address the more challenging yet realistic setting of few-shot experimental design, where only a few labeled data points of input designs and their corresponding values are available. We introduce Experiment Pretrained Transformers (ExPT), a foundation model for few-shot experimental design that combines unsupervised learning and in-context pretraining. In ExPT, we only assume knowledge of a finite collection of unlabelled data points from the input domain and pretrain a transformer neural network to optimize diverse synthetic functions defined over this domain. Unsupervised pretraining allows ExPT to adapt to any design task at test time in an in-context fashion by conditioning on a few labeled data points from the target task and generating the candidate optima. We evaluate ExPT on few-shot experimental design in challenging domains and demonstrate its superior generality and performance compared to existing methods. The source code is available at https://github.com/tung-nd/ExPT.git. |
Tung Nguyen · Sudhanshu Agrawal · Sudhanshu Agrawal · Aditya Grover 🔗 |
-
|
SE(3)-Invariant Multiparameter Persistent Homology for Chiral-Sensitive Molecular Property Prediction
(
Poster
)
>
link
In this study, we present a novel computational method for generating molecular fingerprints using multiparameter persistent homology (MPPH). This technique holds considerable significance for key areas such as drug discovery and materials science, where precise molecular property prediction is vital. By integrating SE(3)-invariance with Vietoris-Rips persistent homology, we effectively capture the three-dimensional representations of molecular chirality. Chirality, an intrinsic feature of stereochemistry, is dictated by the spatial orientation of atoms within a molecule, defining its unique 3D configuration. This non-superimposable mirror image property directly influences the molecular interactions, thereby serving as an essential factor in molecular property prediction. We explore the underlying topologies and patterns in molecular structures by applying Vietoris-Rips persistent homology across varying scales and parameters such as atomic weight, partial charge, bond type, and chirality. Our method's efficacy can be further improved by incorporating additional parameters such as aromaticity, orbital hybridization, bond polarity, conjugated systems, as well as bond and torsion angles. Additionally, we leverage Stochastic Gradient Langevin Boosting (SGLB) in a Bayesian ensemble of Gradient Boosting Decision Trees (GBDT) to obtain aleatoric and epistemic uncertainty estimates for gradient boosting models. Using these uncertainty estimates, we prioritize high-uncertainty samples for active learning and model fine-tuning, benefiting scenarios where data labeling is costly or time consuming. Our approach offers unique insights into molecular structure, distinguishing it from traditional single-parameter or single-scale analyses. When compared to conventional graph neural networks (GNNs) which usually suffer from oversmoothing and oversquashing, MPPH provides a more comprehensive and interpretable characterization of molecular data topology. We substantiate our approach with theoretical stability guarantees and demonstrate its superior performance over existing state-of-the-art methods in predicting molecular properties through extensive evaluations on the MoleculeNet benchmark datasets. |
Andaç Demir · Francis Prael III · Bulent Kiziltan 🔗 |
-
|
Self-supervised Learning to Discover Physical Objects and Predict Their Interactions from Raw Videos
(
Poster
)
>
link
The ability to discover objects from raw videos and to predict their future dynamics is crucial for achieving general intelligence. While existing methods accomplish these two tasks separately, i.e., learning object segmentation with fixed dynamics or learning dynamics with known system states, we explore the feasibility of jointly accomplishing the two together in a self-supervised setting for physical environments. Critically, we show on real video datasets that learning object dynamics improves the accuracy of discovering dynamical objects. |
Sheng Cheng · 'YZ' Yezhou Yang · Yang Jiao · Yi Ren 🔗 |
-
|
Deep Bayesian Experimental Design for Quantum Many-Body Systems
(
Poster
)
>
link
Bayesian experimental design is a technique that allows to efficiently select measurements to characterize a physical system by maximizing the expected information gain. Recent developments in deep neural networks and normalizing flows allow for a more efficient approximation of the posterior and thus the extension of this technique to complex high-dimensional situations. In this paper, we show how this approach holds promise for adaptive measurement strategies to characterize present-day quantum technology platforms. In particular, we focus on arrays of coupled cavities and qubit arrays. Both represent model systems of high relevance for modern applications, like quantum simulations and computing, and both have been realized in platforms where measurement and control can be exploited to characterize and counteract unavoidable disorder. Thus, they represent ideal targets for applications of Bayesian experimental design. |
Leopoldo Sarra · Florian Marquardt 🔗 |
-
|
Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers
(
Poster
)
>
link
In recent years, significant progress has been made in the field of protein function prediction with the development of various machine-learning approaches.However, most existing methods formulate the task as a multi-classification problem, i.e. assigning predefined labels to proteins.In this work, we propose a novel approach, Prot2Text, which predicts a protein's function in a free text style, moving beyond the conventional binary or categorical classifications.By combining Graph Neural Networks(GNNs) and Large Language Models(LLMs), in an encoder-decoder framework, our model effectively integrates diverse data types including protein sequence, structure, and textual annotation and description.This multimodal approach allows for a holistic representation of proteins' functions, enabling the generation of detailed and accurate functional descriptions.To evaluate our model, we extracted a multimodal protein dataset from SwissProt, and demonstrate empirically the effectiveness of Prot2Text.These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins. |
Hadi Abdine · Michail Chatzianastasis · Costas Bouyioukos · Michalis Vazirgiannis 🔗 |
-
|
Evaluating the structure of cognitive tasks with transfer learning
(
Poster
)
>
link
Electroencephalography (EEG) decoding is a challenging task due to the limited availability of labelled data. While transfer learning is a promising technique to address this challenge, it assumes that transferable data domains and tasks are known, which is not the case in this setting. This study investigates the transferability of deep learning representations between different EEG decoding tasks. We conduct extensive experiments using state-of-the-art decoding models on two recently released EEG datasets, ERP Core and M3CV, containing over 140 subjects and 11 distinct cognitive tasks. We measure the transferability of learned representations by pre-training deep neural networks on one task and assessing their ability to decode subsequent tasks. Our experiments demonstrate that, even with linear probing transfer, significant improvements in decoding performance can be obtained, with gains of up to 28% compared with the pure supervised approach. Additionally, we discover evidence that certain decoding paradigms elicit specific and narrow brain activities, while others benefit from pre-training on a broad range of representations. By revealing which tasks transfer well and demonstrating the benefits of transfer learning for EEG decoding, our findings have practical implications for mitigating data scarcity in this setting. The transfer maps generated also provide insights into the hierarchical relations between cognitive tasks, hence enhancing our understanding of how these tasks are connected from a neuroscientific standpoint. |
Bruno Aristimunha · Raphael de Camargo · Walter Lopez Pinaya · Sylvain Chevallier · Alexandre Gramfort · Cédric ROMMEL 🔗 |
-
|
Latent Space Simulator for Unveiling Molecular Free Energy Landscapes and Predicting Transition Dynamics
(
Poster
)
>
link
Free Energy Surfaces (FES) and metastable transition rates are key elements in understanding the behavior of molecules within a system. However, the typical approaches require computing force fields across billions of time steps in a molecular dynamics (MD) simulation, which is often considered intractable when dealing with large systems or databases. In this work, we propose LaMoDy, a latent-space MD simulator, to effectively tackle the intractability with around 20-fold speed improvements compared to classical MD. The model leverages a chirality-aware SE(3)-invariant encoder-decoder architecture to generate a latent space coupled with a recurrent neural network to run the time-wise dynamics. We show that LaMoDy effectively recovers realistic trajectories and FES more accurately and faster than existing methods while capturing their major dynamical and conformational properties. Furthermore, the proposed approach can generalize to molecules outside the training distribution. |
Simon Dobers · Simon Dobers · Hannes Stärk · Xiang Fu · Dominique Beaini · Stephan Günnemann 🔗 |
-
|
Bayesian Machine Scientist for Model Discovery in Psychology
(
Poster
)
>
link
The rapid growth in complex datasets within the field of psychology poses challenges for integrating observations into quantitative models of human information processing. Other fields of research, such as physics, proposed equation discovery techniques as a way of automating data-driven discovery of interpretable models. One such approach is the Bayesian Machine Scientist (BMS), which employs Bayesian inference to derive mathematical equations linking input variables to an output variable. While BMS has shown promise, its application has been limited to a small subset of scientific domains. This study examines the utility of BMS for model discovery in psychology. In Experiment 1, we compare BMS in recovering four models of human information processing against two common psychological benchmark models---linear/logit regression and a black-box neural network---across a spectrum of noise levels. BMS outperformed the benchmark models on the majority of noise levels and demonstrated at least equivalent performance when considering higher levels of noise. These findings demonstrate BMS’s potential for discovering psychological models of human information processing. In Experiment 2, we investigated the impact of informed priors on BMS recovery, comparing domain-specific function priors against a benchmark uniform prior. Specifically, we investigated four priors across research domains spanning their specificity to psychology. We observe that informed priors robustly enhanced BMS performance for only one of the four models of human information processing. In summary, our findings demonstrate the effectiveness of BMS in recovering computational models of human information processing across a range of noise levels; however, whether integrating expert knowledge into the BMS framework improves performance remains a subject of further inquiry. |
Joshua Hewson · Younes Strittmatter · Ioana Marinescu · Chad Williams · Sebastian Musslick 🔗 |
-
|
Representing Core-collapse Supernova Light Curves Analytically with Symbolic Regression
(
Poster
)
>
link
In anticipation of new astrophysical surveys such as the upcoming Legacy Survey of Space and Time conducted by the Vera C. Rubin Observatory, machine learning techniques are increasingly used to rapidly classify transient events in the night sky. Most often, deep-learning based methods rely on unphysical and uninterpretable representations of astrophysical data. In this work, we use symbolic regression to derive an analytic expression for the luminosity of the most common core-collapse supernova (the explosive death of a massive star) as a function of time and physical parameters--an analytical expression for these events has eluded the literature for a century. This expression is trained from a set of simulated bolometric light curves (measured luminosity as a function of time) generated from six input physical parameters. We find that a single analytic expression can adequately reproduce ~70% of light curves in our dataset; we additionally present a small set of analytical expressions to reproduce the full set of light curves. By deriving an analytic relation between physical parameters and light curve fluxes, we create (for the first time) an interpretable parametric model and circumvent the computationally expensive integrations used to simulate the original dataset. This work demonstrates promising preliminary results for future efforts to make machine learning techniques in astronomy more transparent and interpretable. |
Kaylee de Soto · V Villar 🔗 |
-
|
Re-evaluating Retrosynthesis Algorithms with Syntheseus
(
Poster
)
>
link
The planning of how to synthesize molecules, also known as retrosynthesis, has been a growing focus of the machine learning and chemistry communities in recent years. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques. To remedy this, we present a benchmarking library called syntheseus which promotes best practice by default, enabling consistent meaningful evaluation of single-step and multi-step retrosynthesis algorithms. We use syntheseus to re-evaluate a number of previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes when evaluated carefully. We end with guidance for future works in this area. |
Krzysztof Maziarz · Austin Tripp · Austin Tripp · Guoqing Liu · Guoqing Liu · Megan J Stanley · Megan J Stanley · Shufang Xie · Shufang Xie · Piotr Gaiński · Piotr Gaiński · Philipp Seidl · Philipp Seidl · Marwin Segler · Marwin Segler 🔗 |
-
|
Sensitivity Analysis of Simulation-Based Inference for Galaxy Clustering
(
Poster
)
>
link
Simulation-based inference (SBI) is a promising approach to leverage high fidelity cosmological simulations and extract information from the non-Gaussian, non-linear scales that cannot be modeled analytically. However, scaling SBI to the next generation of cosmological surveys faces the computational challenge of requiring a large number of accurate simulations over a wide range of cosmologies, while simultaneously encompassing large cosmological volumes at high resolution. This challenge can potentially be mitigated by balancing the accuracy and computational cost for different component models of the simulations while ensuring robust inference. To guide our steps in this, we perform a sensitivity analysis of SBI for galaxy clustering on various main components of the cosmological simulations: gravity model, halo-finder and the galaxy-halo distribution models. We infer two main cosmological parameters using galaxy power spectrum multipoles (two-point statistics) and the bispectrum monopole (three-point statistics) assuming a galaxy number density expected from current generation of galaxy surveys. We find that SBI is insensitive to changing gravity model between accureate and slow $N$-body simulations and approximate and fast particle mesh simulations. However, changing the methodology of finding the collapsed dark matter structures called halos which galaxies populate can lead to biased cosmological inferences. For models of how galaxies populate these halos, training SBI on more complex model leads to consistent inference for less complex models, but SBI trained on simpler models fails when applied to analyze data from a more complex model.
|
Shivam Pandey · Chirag Modi · Benjamin Wandelt · Matthew Ho · ChangHoon Hahn · Bruno Régaldo-Saint Blancard 🔗 |
-
|
Learning Scalar Fields for Molecular Docking with Fast Fourier Transforms
(
Poster
)
>
link
Molecular docking is critical to structure-based virtual screening, yet the throughput of such workflows is limited by the expensive optimization of scoring functions involved in most docking algorithms. We explore how machine learning can accelerate this process by learning a scoring function with a functional form that allows for more rapid optimization. Specifically, we define the scoring function to be the cross-correlation of multi-channel ligand and protein scalar fields parameterized by equivariant graph neural networks, enabling rapid optimization over rigid-body degrees of freedom with fast Fourier transforms. Moreover, the runtime of our approach can be amortized at several levels of abstraction, and is particularly favorable for virtual screening settings with a common binding pocket. We benchmark our scoring functions on two simplified docking-related tasks: decoy pose scoring and rigid conformer docking. Our method attains similar but faster performance on crystal structures compared to the Vina and Gnina scoring functions, and is more robust on computationally predicted structures. |
Bowen Jing · Bowen Jing · Tommi Jaakkola · Tommi Jaakkola · Bonnie Berger 🔗 |
-
|
Hypothesis Tests for Distributional Group Symmetry with Applications to Particle Physics
(
Poster
)
>
link
Symmetry plays a central role in the sciences, machine learning, and statistics. When data are known to obey a symmetry, various methods that exploit symmetry have been developed. However, statistical tests for the presence of group invariance largely focus on a handful of specialized situations, and tests for equivariance are largely non-existent. This work formulates non-parametric hypothesis tests, based on a single independent and identically distributed sample, for distributional symmetry under a specified group. We provide a general formulation of tests for symmetry within two broad settings. The first setting tests for the invariance of a marginal or joint distribution under the action of a compact group. We propose a conditional Monte Carlo test that achieves exact $p$-values with finitely many observations and Monte Carlo samples. The second setting tests for the invariance or equivariance of a conditional distribution under the action of a locally compact group. We show that the test for conditional symmetry can be formulated as a test of conditional independence. We implement our tests using kernel methods and apply them to testing for symmetry in problems from high-energy particle physics.
|
Kenny Chiu · Kenny Chiu · Benjamin Bloem-Reddy · Benjamin Bloem-Reddy 🔗 |
-
|
Unveiling the Secrets of $^1$H-NMR Spectroscopy: A Novel Approach Utilizing Attention Mechanisms
(
Poster
)
>
link
The significance of Nuclear Magnetic Resonance (NMR) spectroscopy in organic synthesis cannot be overstated, as it plays a pivotal role in deducing chemical structures from experimental data. While machine learning has predominantly been employed for predictive purposes in the analysis of spectral data, our study introduces a novel application of a transformer-based model's attention weights to unravel the underlying "language" that correlates spectral peaks with their corresponding atom in the chemical structures. This attention mapping technique proves beneficial for comprehending spectra, enabling the reliable differentiation between product $^1$H-NMR spectra and reactant spectra extracted from experimental data with an accuracy exceeding 95\%. Furthermore, it consistently associates peaks with the correct atoms in the molecule, achieving a remarkable peak-to-atom match rate of 71\% for exact match and 89\% of close shift matching ($\pm$ 0.59ppm).This framework exemplifies the capability of harnessing the attention mechanism within transformer models to unveil the intricacies of spectroscopic data. Importantly, this approach can readily be extended to other types of spectra, showcasing its versatility and potential for broader applications in the field.
|
Oliver Schilter · Marvin Alberts · Federico Zipoli · Alain C. Vaucher · Philippe Schwaller · Teodoro Laino 🔗 |
-
|
Vertical AI-driven Scientific Discovery
(
Poster
)
>
link
Automating scientific discovery has been a grand goal of Artificial Intelligence (AI) and will bring tremendous societal impact if it succeeds. Despite exciting progress, most endeavor in learning scientific equations from experiment data focuses on the horizontal discovery paths, i.e., they directly search for the best equation in the full hypothesis space. Horizontal paths are challenging because of the associated exponentially large search space. Our work explores an alternative vertical path, which builds scientific equations in an incremental way, starting from one that models data in control variable experiments in which most variables are held as constants. It then extends expressions learned in previous generations via adding new independent variables, using new control variable experiments in which these variables are allowed to vary. This vertical path was motivated by human scientific discovery processes. Experimentally, we demonstrate that such vertical discovery paths expedite symbolic regression. It also improves learning physics models describing nano-structure evolution in computational materials science. |
Yexiang Xue · Yexiang Xue 🔗 |
-
|
Scalable Multimer Structure Prediction using Diffusion Models
(
Poster
)
>
link
Accurate protein complex structure modeling is a necessary step in understanding the behavior of biological pathways and cellular systems. While some works have attempted to address this challenge, there is still a need for scaling existing methods to larger protein complexes. To address this need, we propose a novel diffusion generative model (DGM) that predicts large multimeric protein structures by learning to rigidly dock its chains together. Additionally, we construct a new dataset specifically for large protein complexes used to train and evaluate our DGM. We substantially improve prediction runtime and completion rates while maintaining competitive accuracy with current methods. |
Peter Pao-Huang · Bowen Jing · Bowen Jing · Bonnie Berger 🔗 |
-
|
AlphaFold Meets Flow Matching for Generating Protein Ensembles
(
Poster
)
>
link
The significant success of AlphaFold2 at protein structure prediction has pointed to structural ensembles as the next frontier towards a more complete computational understanding of protein structure. At the same time, iterative refinement-based techniques such as diffusion have driven significant breakthroughs in generative modeling. We explore the synergy of these developments by combining highly accurate protein structure prediction models with flow matching, a powerful modern generative modeling framework, in order to sample the conformational landscape of proteins. Preliminary results on membrane transporters, ligand-induced conformational change, and disordered ensembles show the potential of the approach. Importantly, and unlike MSA-based methods, our method also obtains similar distributions even when used with language model-based algorithms such as ESMFold, which are otherwise deterministic given an input sequence. These results open exciting avenues in the computational prediction of conformational flexibility. |
Bowen Jing · Bowen Jing · Bonnie Berger · Bonnie Berger · Tommi Jaakkola 🔗 |
-
|
ComboPath: A model for predicting drug combination effects
(
Poster
)
>
link
Drug combinations have been shown to be an effective strategy for cancer therapy, but identifying beneficial combinations through experiments is labor-intensive and expensive Machine learning (ML) systems that can propose novel and effective drug combinations have the potential to dramatically improve the efficiency of combinatoric drug design.{However, the biophysical parameters of drug combinations are degenerate, making it challenging to identify the ground truth of drug interactions even given high-quality experimental data. Existing ML models are highly underspecified to meet this challenge, leaving them vulnerable to producing parameters that are not biophysically realistic and harming generalization.We have developed a new ML model, ``ComboPath,'' to predict the cellular dose-response surface of a two-drug combination based on each drug's interactions with their known protein targets. {ComboPath incorporates a biophysically-motivated intermediate parameterization with prior information used to improve model specification. This} is the first ML model to nominate beneficial drug combinations while simultaneously reconstructing the dose-response surface, providing insight into both the potential of a drug combination and its optimal dosing for therapeutic development. We show that our models were able to accurately reconstruct 2D dose response surfaces across held-out combination samples from the largest available combinatoric screening dataset while {substantially improving model specification for key biophysical parameters} |
Duminda Ranasinghe · Changchang Liu · Daniel Spitz · Hok Hei Tam · Nathan Sanders 🔗 |
-
|
ORDerly: Datasets and benchmarks for chemical reaction data
(
Poster
)
>
link
Machine learning has the potential to provide tremendous value to the life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction datasets for training ML models. Herein, we present ORDerly, an open-source Python package for customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean US patent data stored in ORD and generate datasets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on datasets generated with ORDerly for condition prediction and show that datasets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalisation. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry. |
Daniel Wigh · Daniel Wigh · Joe Arrowsmith · Joe Arrowsmith · Alexander Pomberger · Kobi Felton · Kobi Felton · Alexei Lapkin · Alexei Lapkin 🔗 |
-
|
Stochastic force inference via density estimation
(
Poster
)
>
link
Inferring dynamical models from low-resolution temporal data continues to be a significant challenge in biophysics, especially within transcriptomics, where separating molecular programs from noise remains an important open problem. We explore a common scenario in which we have access to an adequate amount of cross-sectional samples at a few time-points, and assume that our samples are generated from a latent diffusion process. We propose an approach that relies on the probability flow associated with an underlying diffusion process to infer an autonomous, nonlinear force field interpolating between the distributions. Given a prior on the noise model, we employ score-matching to differentiate the force field from the intrinsic noise. Using relevant biophysical examples, we demonstrate that our approach can extract non-conservative forces from non-stationary data, that it learns equilibrium dynamics when applied to steady-state data, and that it can do so with both additive and multiplicative noise models. |
Victor Chardès · Suryanarayana Maddu · Michael Shelley 🔗 |
-
|
Mitigating Bias in Scientific Data: a Materials Science Case Study
(
Poster
)
>
link
Growing scientific data and data-driven informatics drastically promote scientific discovery. While there are significant advancements in data-driven models, the quality of data resources is less studied despite its huge impact on model performance. As an example, we focus on data bias arising from uneven coverage of materials families in existing knowledge. Observing different diversities among crystal systems in common materials databases, we propose an information entropy-based metric for measuring this bias. To mitigate the bias, we develop an entropy-targeted active learning (ET-AL) framework, which guides the acquisition of new data to improve the diversity of underrepresented crystal systems. We demonstrate the capability of ET-AL for bias mitigation and the resulting improvement in downstream machine learning models. This approach is broadly applicable to data-driven materials discovery, including autonomous data acquisition and dataset trimming to reduce bias, as well as data-driven informatics in other scientific domains. |
Hengrui Zhang · Wei Chen · James Rondinelli · Wei Chen 🔗 |
-
|
Mapping the intermolecular interaction universe through self-supervised learning on molecular crystals
(
Poster
)
>
link
Molecular interactions fundamentally influence all aspects of chemistry and biology. Prevailing machine learning approaches emphasize the modeling of molecules in isolation or at best provide limited modeling of molecular interactions, typically restricted to protein-ligand and protein-protein interactions. Here, we present how to use molecular crystals to define the MolInteractDB dataset that contains valuable biochemical knowledge, which can be captured by large self-supervised pre-trained models. MolInteractDB incorporates 344,858 molecular crystal structure entries from the Cambridge Structural Database. We formulate entries in the MolInteractDB dataset as radial patches of flexible size and at varying positions in the crystal to represent intermolecular interactions across crystal structures. We characterize a variety of interactions highlighted across 6 million patches. Leveraging MolInteractDB, we develop InteractNN, a self-supervised SE(3)-equivariant 3D message passing network. We show that InteractNN captures the latent knowledge of chemical elements as well as intermolecular interaction types at a scale not directly accessible to human scientists. To demonstrate its potential, we fine-tuned InteractNN to predict the binding affinity between proteins and ligands, producing results comparable with state-of-the-art models. |
Ada Fang · ZAIXI ZHANG · Marinka Zitnik 🔗 |
-
|
ATAT: Automated Tissue Alignment and Traversal
(
Poster
)
>
link
The spatial geometry of tissue biopsies reveals complex landscapes of cellular interactions. With the advent of spatial transcriptomics (ST), the ability to measure RNA across these intricate terrains has significantly advanced. However, without a pathologist’s insight to delineate regions of interest, modeling gene expression transitions across specific regions becomes a daunting task. A case in point is grading the severity of inflammatory bowel disease (IBD) across the intestinal wall while identifying the organization of immune cell types across the tissue layers; such characterization will be essential in the push for precision medicine. Yet the challenge to harness ST data to decipher spatially dependent transcriptional programs in a scalable and automated manner remains a well acknowledged barrier to wider implementation. Our study aims to: (1) Utilize hematoxylin and eosin (H\&E) stained images for automated segmentation of histological regions and (2) Simulate the gene expression transition across these histological layers within a single algorithmic framework. To these ends, we present ATAT: Automated Tissue Alignment and Traversal. With our approach, we automate the integration of H\&E stained images with spatial transcriptomics and simplify the investigation of important biomedical questions, such as characterization of inflammatory conditions across intestinal walls. |
Steven Song · Steven Song · Emaan Mohsin · Andrey Kuznetsov · Andrey Kuznetsov · Christopher Weber · Robert Grossman · Robert Grossman · Aly Khan 🔗 |
-
|
Van der Pol-informed Neural Networks for Multi-step-ahead Forecasting of Extreme Climatic Events
(
Poster
)
>
link
Deep learning has produced excellent results in several applied domains including computer vision, natural language processing, speech recognition, etc. Physics-informed neural networks (PINN) are a new family of deep learning models that combine prior knowledge of physics in the form of high-level abstraction of natural phenomena with data-driven neural networks. PINN has emerged as a flourishing area of scientific computing to deal with the challenges of shortage of training data, enhancing physical plausibility, and specifically aiming to solve complex differential equations. However, building PINNs for modeling and forecasting the dynamics of extreme climatic events of geophysical systems remains an open scientific problem. This study proposes Van der Pol-informed Neural Networks (VPINN), a physics-informed differential learning approach, for modeling extreme nonlinear dynamical systems such as climatic events, exploiting the physical differentials as the physics-derived loss function. Our proposal is compared to state-of-the-art time series forecasting models, showing superior performance. |
Anurag Dutta · Anurag Dutta · Madhurima Panja · Madhurima Panja · Uttam Kumar · Uttam Kumar · Chittaranjan Hens · Tanujit Chakraborty · Tanujit Chakraborty 🔗 |
-
|
Deep Learning with Physics Priors as Generalized Regularizers
(
Poster
)
>
link
Regularization is a key technique to avoid overfitting and to improve generalization of deep learning models. In many scientific and engineering applications, an approximate model of the complex system is usually known, although with both aleatoric and epistemic uncertainties. We present a principled method to incorporate these approximate models as physics priors in model training, by structuring the priors as generalized regularizers. The experimental results demonstrate that our method achieves one to two orders of magnitude of improvement in testing accuracy |
Frank Liu · Agniva Chowdhury 🔗 |
-
|
SNIP: Bridging Mathematical Symbolic and Numeric Realms with Unified Pre-training
(
Poster
)
>
link
In scientific inquiry, symbolic mathematical equations play a fundamental role in modeling complex natural phenomena. Leveraging the power of deep learning, we introduce SNIP, a Multi-Modal Symbolic-Numeric Pre-training framework. By employing joint contrastive learning between symbolic and numeric domains, SNIP enhances their mutual alignment in pre-trained embeddings. Latent space analysis reveals that symbolic supervision significantly enriches the embeddings of numeric data, and vice versa. Evaluations across diverse tasks, including symbolic-to-numeric and numeric-to-symbolic property prediction, demonstrate SNIP's superior performance over fully supervised baselines. This advantage is particularly pronounced in few-shot learning scenarios, making SNIP a valuable asset in situations with limited available data. |
Kazem Meidani · Kazem Meidani · Parshin Shojaee · Parshin Shojaee · Chandan Reddy · Chandan Reddy · Amir Barati Farimani 🔗 |
-
|
Interpretable Neural PDE Solvers using Symbolic Frameworks
(
Poster
)
>
link
Partial differential equations (PDEs) are ubiquitous in the world around us, modelling phenomena from heat and sound to quantum systems. Recent advances in deep learning have resulted in the development of powerful neural solvers; however, while these methods have demonstrated state-of-the-art performance in both accuracy and computational efficiency, a significant challenge remains in their interpretability. Most existing methodologies prioritize predictive accuracy over clarity in the underlying mechanisms driving the model's decisions. Interpretability is crucial for trustworthiness and broader applicability, especially in scientific and engineering domains where neural PDE solvers might see the most impact. In this context, a notable gap in current research is the integration of symbolic frameworks (such as symbolic regression) into these solvers. Symbolic frameworks have the potential to distill complex neural operations into human-readable mathematical expressions, bridging the divide between black-box predictions and solutions. |
Yolanne Lee · Yolanne Lee 🔗 |
-
|
Infusing Spatial Knowledge into Deep Learning for Earth Science: A Hydrological Application
(
Poster
)
>
link
The integration of Artificial Intelligence (AI) into Earth science, including areas such as geology, ecology, and hydrology, brings potential for significant advancements. Despite this potential, applying deep learning techniques to spatial data in this field is often hindered by the lack of domain knowledge. This paper studies the integration of spatial domain knowledge and deep learning for Earth science. The problem is challenging due to the sparse and noisy input labels, spatial uncertainty, and high computational costs associated with a large number of sample locations. Existing works on neuro-symbolic models focus on integrating symbolic logic into neural networks (e.g., loss function, model architecture, and training label augmentation), but these methods do not fully address the specific spatial data challenges. To bridge this gap, we propose a Spatial Knowledge-Infused Hierarchical Learning (SKI-HL) framework, which iteratively infers labels within a multi-resolution hierarchy, and trains the deep learning model with uncertainty-aware multi-instance learning. The evaluation of real-world hydrological datasets demonstrates the enhanced performance of the SKI-HL framework over several baseline methods. |
Zelin Xu · Tingsong Xiao · Wenchong He · Yu Wang · Zhe Jiang 🔗 |
-
|
Protein Language Model-Powered 3-Dimensional Ligand Binding Site Prediction from Protein Sequence
(
Poster
)
>
link
Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline methods that require 3D protein structures when predicting binding sites. Given that less than 50% of proteins have reliable structure information in the current stage, LaMPSite will provide new opportunities for drug discovery. |
Shuo Zhang · Lei Xie 🔗 |
-
|
Multiple Physics Pretraining for Physical Surrogate Models
(
Oral
)
>
link
We introduce multiple physics pretraining (MPP), an autoregressive task-agnostic pretraining approach for physical surrogate modeling. MPP involves training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks. In order to learn effectively in this setting, we introduce a shared embedding and normalization strategy that projects the fields of multiple systems into a single shared embedding space. We validate the efficacy of our approach on both pretraining and downstream tasks. In pretraining, we show that a single MPP-pretrained model is able to match or outperform task-specific baselines on all training sub-tasks without the need for finetuning. For downstream tasks, we explore how the benefits of MPP scale with available finetuning data and demonstrate pretraining gains even across large physics gaps. We open-source our code and model weights trained at multiple scales for reproducibility and community experimentation. |
John McCabe · Bruno Régaldo-Saint Blancard · Liam Parker · Ruben Ohana · Miles Cranmer · Alberto Bietti · Michael Eickenberg · Siavash Golkar · Geraud Krawezik · Francois Lanusse · Mariel Pettee · Tiberiu Tesileanu · Kyunghyun Cho · Shirley Ho
|
-
|
Transition Path Sampling with Boltzmann Generator-based MCMC Moves
(
Poster
)
>
link
Sampling all possible transition paths between two 3D states of a molecular system has various applications ranging from catalyst design to drug discovery. Current approaches to sample transition paths use Markov chain Monte Carlo and rely on time-intensive molecular dynamics simulations to find new paths. Our approach operates in the latent space of a normalizing flow that maps from the molecule's Boltzmann distribution to a Gaussian, where we propose new paths without requiring molecular simulations. Using alanine dipeptide, we explore Metropolis-Hastings acceptance criteria in the latent space for exact sampling and investigate different latent proposal mechanisms. |
Michael Plainer · Hannes Stärk · Charlotte Bunne · Stephan Günnemann 🔗 |
-
|
scCLIP: Multi-modal Single-cell Contrastive Learning Integration Pre-training
(
Poster
)
>
link
Recent advances in multi-modal single-cell sequencing technologies enable the simultaneous profiling of chromatin accessibility and transcriptome in individual cells. Integration analysis of multi-modal single-cell data offers a more comprehensive understanding of the regulatory mechanisms linking chromatin status and gene expression, driving cellular processes and diseases. In order to acquire features that align peaks and genes within the same embedding space and facilitate seamless zero-shot transfer to new data, we introduced $\texttt{scCLIP}$ (single-cell Contrastive Learning Integration Pretraining), a generalized multi-modal transformer model with contrastive learning. We show that this model outperforms other competing methods, and beyond this, $\texttt{scCLIP}$ learns transferable features across modalities and generalizes to unseen datasets, which pose the great potential to bridge the vast number of unpaired unimodal datasets both existing and new data generated in the future. Specifically, we propose the first large-scale transformer model designed for single-cell ATAC-seq data by patching peaks across the genomes and representing each patch as a token. This innovative approach enables us effectively to address the scalability challenges posed by scATAC-seq, even when dealing with datasets of up to $\textit{one million}$ dimensions. Codes are provided at: https://anonymous.4open.science/r/scCLIP-61F6/.
|
Lei Xiong · Tianlong Chen · Manolis Kellis 🔗 |
-
|
xVal: A Continuous Number Encoding for Large Language Models
(
Poster
)
>
link
Large Language Models (LLMs) have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization. |
Siavash Golkar · Mariel Pettee · Michael Eickenberg · Alberto Bietti · Miles Cranmer · Geraud Krawezik · Francois Lanusse · John McCabe · Ruben Ohana · Liam Parker · Bruno Régaldo-Saint Blancard · Tiberiu Tesileanu · Kyunghyun Cho · Shirley Ho
|
-
|
XLuminA: An Auto-differentiating Discovery Framework for Super-Resolution Microscopy
(
Oral
)
>
link
In this work we introduce XLuminA, an original computational framework designed for the discovery of novel optical hardware in super-resolution microscopy. Our framework offers auto-differentiation capabilities, allowing for the fast and efficient simulation and automated design of entirely new optical setups from scratch. We showcase its potential by rediscovering three foundational experiments, each one covering different areas in optics: an optical telescope, STED microscopy and the focusing beyond the diffraction limit of a radially polarized light beam. Intriguingly, for this last experiment, the machine found an alternative solution following the same physical principle exploited for breaking the diffraction limit. With XLuminA, can we go beyond simple optimization and calibration of known experimental setups, opening the door to potentially uncovering new microscopy concepts within the vast landscape of experimental possibilities. |
Carla Rodríguez · Sören Arlt · Leonhard Möckl · Mario Krenn 🔗 |
-
|
Latent Diffusion Model for DNA Sequence Generation
(
Poster
)
>
link
The harnessing of machine learning, especially deep generative models, has opened up promising avenues in the field of synthetic DNA sequence generation. Whilst Generative Adversarial Networks (GANs) have gained traction for this application, they often face issues such as limited sample diversity and mode collapse. On the other hand, Diffusion Models are a promising new class of generative models that are not burdened with these problems, enabling them to reach the state-of-the-art in domains such as image generation. In light of this, we propose a novel latent diffusion model, DiscDiff, tailored for discrete DNA sequence generation. By simply embedding discrete DNA sequences into a continuous latent space using an autoencoder, we are able to leverage the powerful generative abilities of continuous diffusion models for the generation of discrete data. Additionally, we introduce Fréchet Reconstruction Distance (FReD) as a new metric to measure the sample quality of DNA sequence generations. Our DiscDiff model demonstrates an ability to generate synthetic DNA sequences that align closely with real DNA in terms of Motif Distribution, Latent Embedding Distribution (FReD), and Chromatin Profiles. Additionally, we contribute a comprehensive cross-species dataset of 150K unique promoter-gene sequences from 15 species, enriching resources for future generative modelling in genomics. We will make our code public upon publication. |
Zehui Li · Yuhao Ni · Tim Huygelen · Akashaditya Das · Guoxuan Xia · Guy-Bart Stan · Yiren Zhao 🔗 |
-
|
Exploring the applications of Neural Cellular Automata in molecular sciences
(
Poster
)
>
link
In recent years, Cellular Automata have been merged with developments in deep learning to replace the traditional update rules with a neural network. These Neural Cellular Automata (NCAs) have been applied for 2D, and 3D object generation, morphogenesis, as well as the orchestration of goal-directed behavioural responses. While there have been numerous examples of applying NCAs to emoji-like, and common gameplay objects (like houses or trees in Minecraft), their adaption to molecule representations has yet to be explored. In this work, we present an adaptation of 3D NCAs to voxelized representations of small- and bio-molecules. We present three exemplary applications of NCAs to design small-molecule interactors, reconstruct missing parts of protein backbones, and model physical transformations. |
Sebastian Pagel · Lee Cronin 🔗 |
-
|
Koopman-Assisted Reinforcement Learning
(
Oral
)
>
link
The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman (HJB) equation, are ubiquitous in reinforcement learning and control theory contexts due, in part, to their guaranteed convergence towards a system’s optimal value function. However, its application presents very intense limitations. This paper explores the connection between the data-driven Koopman operator and Bellman Markov Decision Processes, resulting in the development of two new reinforcement learning algorithms to alleviate these limitations. In particular, we focus on Koopman operator methods that reformulate a nonlinear system by lifting into a new coordinate system where the dynamics become linear, and where HJB-based methods are more tractable. These transformations enable the estimation, prediction, and control of strongly nonlinear dynamics. Viewing the Bellman equation as a controlled dynamical system, the Koopman operator is able to describe the expectation of the time evolution of the value function in the given systems via linear dynamics in the lifted coordinates. By parameterizing the Koopman operator with control actions and making an assumption about the feature space of the time evolution of the value function, we are able to construct a new “Koopman tensor” that facilitates the estimation of the optimal value function. Finally, a transformation of Bellman’s framework in terms of the Koopman tensor enables us to reformulate two max-entropy reinforcement learning algorithms: soft-value iteration and soft actor-critic (SAC). This framework is very flexible and can be used for deterministic or stochastic systems as well as for discrete or continuous-time dynamics. We show that these algorithms attain SOTA with respect to traditional neural network-based SAC and linear quadratic regulator baselines while retaining interpretability on 3 controlled dynamical systems: the Lorenz system, the fluid flow past a cylinder, and a double-well potential with non-isotropic stochastic forcing. |
Preston Rozwood · Edward Mehrez · Ludger Paehler · Wen Sun · Steven Brunton 🔗 |
-
|
Automated distillation of genomic equations governing single cell gene expression
(
Poster
)
>
link
Gene expression is an essential cellular process that is controlled by a complex and orchestrated regulatory network of transcription factors and epigenetic modifications.The advancement in single-cell RNA sequencing enables the investigation of gene expression control at an unprecedented fine resolution and large scale. Yet, understanding the sequence determinants underlying distinct primary cell types remains elusive and challenging.While deep neural networks have shown strong performance in predicting gene expression, the lack of meaningful explanations of predictions, especially in systematic understanding of the molecular mechanisms, motivates the search for more transparent models. We present an automated model that predicts gene expression from genetic sequences while providing both strong performance and direct interpretations of predictions. Our model combines a pre-trained genetic sequence class model and neural architecture search with symbolic regression to distill explainable genomic equations. We applied our method to an in-house human pituitary (a specialized gland in the brain that controls the endocrine system) single-cell gene expression data. The distilled genomic equation prediction accuracy (Pearson r=0.713) is comparable to other explainable models, without artificially introducing strong inductive bias that may not hold for the complex and potentially non-linear cellular system.The genomic equations shed light on how sequence classes interact and regulate the cell type-specific, finely-controlled transcriptomic program in the human endocrine system.To our knowledge, this is the first attempt at distilling genomic equations from neural networks using symbolic regression. |
Edouardo Honig · Frederique Ruf Zamojski · Stuart Sealfon · Ying Nian Wu · Zijun Frank Zhang 🔗 |
-
|
RetroBridge: Modeling Retrosynthesis with Markov Bridges
(
Poster
)
>
link
Retrosynthesis planning is a fundamental challenge in chemistry which aims at designing multi-step reaction pathways from commercially available starting materials to a target molecule. Each step in multi-step retrosynthesis planning requires accurate prediction of possible precursor molecules given the target molecule and confidence estimates to guide heuristic search algorithms. We model single-step retrosynthesis as a distribution learning problem in a discrete state space. First, we introduce the Markov Bridge Model, a generative framework aimed to approximate the dependency between two intractable discrete distributions accessible via a finite sample of coupled data points. Our framework is based on the concept of a Markov bridge, a Markov process pinned at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does not need a tractable noise distribution as a sampling proxy and directly operates on the input product molecules as samples from the intractable prior distribution. We then address the retrosynthesis planning problem with our novel framework and introduce RetroBridge, a template-free retrosynthesis modeling approach that achieves state-of-the-art results on standard evaluation benchmarks. |
Ilia Igashov · Arne Schneuing · Arne Schneuing · Marwin Segler · Marwin Segler · Michael Bronstein · Michael Bronstein · Bruno Correia · Bruno Correia 🔗 |
-
|
Adaptive learning acceleration for nonlinear PDE solvers
(
Poster
)
>
link
We propose a novel type of nonlinear solver acceleration for systems of nonlinear partial differential equations (PDEs) that is based on online/adaptive learning. It is applied in the context of multiphase porous media flow. The presented method is built on four pillars: compaction of the training space using dimensionless numbers, offline training in a representative simplistic (two-dimensional) numerical model, control of the numerical relaxation (or other tuning parameter) of a classical nonlinear solver, and online learning to improve the machine learning model in run time (online training). The approach is capable of reducing the number of nonlinear iterations by dynamically adjusting one single global parameter (the relaxation factor) and by learning on-the-job the characteristics of each numerical model. Its implementation is simple and general. In this work, we have also identified the key dimensionless parameters required, compared the performance of different machine learning models, showed the reduction in the number of nonlinear iterations obtained by using the proposed approach in complex realistic (three-dimensional) models, and for the first time properly coupled a machine learning model into an open-source multiphase flow simulator achieving up to 85\% reduction in computational time. |
Vinicius L S Silva · Vinicius L S Silva · Pablo Salinas · Pablo Salinas · Claire E Heaney · Claire E Heaney · Matthew Jackson · Matthew Jackson · Christopher C Pain · Christopher C Pain 🔗 |
-
|
Unsupervised Representation Learning of Brain Activity via Bridging Voxel Activity and Functional Connectivity
(
Poster
)
>
link
Effective brain representation learning is a key step toward revealing the understanding of cognitive processes and unlocking detecting and potential therapeutic interventions for neurological diseases/disorders. Existing studies have focused on either (1) voxel-level activity, where only a single beta weight for each voxel (i.e., aggregation of voxel activity over a time window) is considered, missing their temporal dynamics, or (2) functional connectivity of the brain in the level of region of interests, missing voxel-level activities. In this paper, we bridge this gap and design BrainMixer, an unsupervised learning framework that effectively utilizes both functional connectivity and associated time series of voxels to learn voxel-level representation in an unsupervised manner. BrainMixer employs two simple yet effective MLP-based encoders to simultaneously learn the dynamics of voxel-level signals and their functional correlations. To encode voxel activity, BrainMixer fuses information across both time and voxel dimensions via a dynamic self-attention mechanism. To learn the structure of the functional connectivity graph, BrainMixer presents a temporal graph patching and encodes each patch by combining its nodes' features via a new adaptive temporal pooling. Our experiments show that BrainMixer attains outstanding performance and outperforms 13 baselines in different downstream tasks and experimental setups. |
Ali Behrouz · Parsa Delavari · Farnoosh Hashemi 🔗 |
-
|
Representation Learning for Spatial Multimodal Data Integration with Optimal Transport
(
Poster
)
>
link
Spatial sequencing technologies have advanced rapidly in the past few years, and recently multiple modalities of cells -- including mRNA expression, chromatin state, and other molecular modalities -- can be measured with corresponding spatial location in tissue slices. To facilitate scientific discoveries from spatial multi-omics sequencing experiments, methods for integrating multimodal spatial data are critically needed. Here we define the problem of spatial multimodal integration as integrating multiple modalities from related tissue slices into a Common Coordinate Framework (CCF) and learning biological meaningful representations for each spatial location in the CCF. We introduce a novel machine learning framework combining optimal transport and variational autoencoders to solve the spatial multimodal integration problem. Our method outperforms existing single-cell multi-omics integration methods that ignore spatial information. Our method allows researchers to analyze tissues comprehensively by integrating knowledge from spatial slices of multiple modalities. |
Xinhao Liu · Benjamin Raphael 🔗 |
-
|
ALAS: Active Learning for Autoconversion Rates Prediction from Satellite Data
(
Poster
)
>
link
High-resolution simulations, such as the ICOsahedral Non-hydrostatic Large-Eddy Model (ICON-LEM), provide valuable insights into the complex interactions among aerosols, clouds, and precipitation, which are the major contributors to climate change uncertainty. However, due to its exorbitant computational costs, it can only be employed for a limited period and geographical area. To address this, we propose a more cost-effective method powered by emerging machine learning approach -- leveraging high-resolution climate simulation as the oracle and abundant unlabeled data drawn from satellite data -- to better understand the intricate dynamics of the climate system. Our approach involves active learning techniques to predict autoconversion rates, a crucial step in precipitation formation, while significantly reducing the need for a large number of labeled instances. In this study, we present novel methods: custom query strategy fusion for labeling instances, WiFi and MeFi, along with active feature selection based on SHAP, designed to tackle real-world challenges due to its simplicity and practicality in application, specifically focusing on the prediction of autoconversion rates. |
Maria Carolina Novitasari · Maria Carolina Novitasari · Johannes Quaas · Miguel Rodrigues · Miguel Rodrigues 🔗 |
-
|
Large Language Models in Molecular Discovery
(
Poster
)
>
link
The success of language models, especially transformers in natural language processing, has trickled into scientific domains, giving rise to the concept of "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle, as evidenced by promising recent findings in early-stage drug discovery. In this perspective, we review the role of language models in molecular discovery, underlining their strengths and examining their weaknesses in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets to lower the entry barrier to the field of scientific language modeling. Furthermore, as a solution to some of the weaknesses we identify, we outline a vision for future molecular design that integrates a chat-bot interface with available computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery. |
Nikita Janakarajan · Tim Erdmann · Sarathkrishna Swaminathan · Teodoro Laino · Jannis Born 🔗 |
-
|
Virtual Receptors for Efficient Molecular Diffusion
(
Poster
)
>
link
Machine learning approaches to Structure-Based Drug Design (SBDD) have proven quite fertile over the last few years. In particular, diffusion-based approaches to SBDD have shown great promise. We present a technique which expands on this diffusion approach in two crucial ways. First, we address the size disparity between the drug molecule and the target/receptor, which makes learning more challenging and inference slower. We do so through the notion of a Virtual Receptor, which is a compressed version of the receptor; it is learned so as to preserve key aspects of the structural information of the original receptor, while respecting the relevant group equivariance. Second, we incorporate a protein language embedding used originally in the context of protein folding. We experimentally demonstrate the contributions of both the virtual receptors and the protein embeddings: in practice, they lead to both better performance, as well as significantly faster computations. |
Matan Halfon · Eyal Rozenberg · Ehud Rivlin · Daniel Freedman 🔗 |
-
|
Insight Miner: A Large-scale Multimodal Model for Insight Mining from Time Series
(
Poster
)
>
link
Time-series data is essential in various science and industry domains, like environmental analysis, agriculture, transportation, and finance. Researchers need to use their domain knowledge to conduct insight mining from time-series data to study scientific topics. However, this process is time-consuming and highly depends on expert knowledge. This paper proposes a large-scale multimodal model (LMM), Insight Miner, to generate decent and comprehensive time-series descriptions with domain-specific knowledge. To introduce rich time-series insights to Insight Miner, we propose a time-series analysis dataset, TS-Insights, composed of time series and textual insight pairs. In the TS-Insights dataset, we include 100k time series windows sampled from 20 forecasting datasets spanning a wide variety of domains and granularities. Through a meticulous combination of heuristics and statistical tools, we preprocess each raw time series window and use GPT-4 to generate a coherent trend description based on the extracted features. After training with the TS-Insights dataset via instruct tuning, the Insight Miner model performs better in generating time series descriptions and insights compared with state-of-the-art multimodality models, such as LLaVA \citep{liu2023llava} and GPT-4. Our findings suggest a promising direction of leveraging LMMs for time series analysis and potentially offering avenues for efficient insight mining in scientific domains. The TS-Insights dataset is available and will be published upon acceptance. |
Yunkai Zhang · Yawen Zhang · Ming Zheng · Kezhen Chen · Chongyang Gao · Ruian Ge · Siyuan Teng · Amine Jelloul · Jinmeng Rao · Xiaoyuan Guo · Chiang-Wei Fang · Zeyu Zheng · Jie Yang
|
-
|
STRIDE: Structure-guided Generation for Inverse Design of Molecules
(
Poster
)
>
link
Machine learning and especially deep learning has had an increasing impact on molecule and materials design. In particular, given the growing access to an abundance of high-quality small molecule data for generative modeling for drug design, which has led to promising results for drug discovery. However, for many important classes of materials such as catalysts, antioxidants, and metal-organic frameworks, such large datasets are not available. Such families of molecules with limited samples and structural similarities are especially prevalent for industrial applications. As it is well-known, retraining and even fine-tuning are challenging on such small datasets. Novel, practically applicable molecules are most often derivatives of well-known molecules, suggesting approaches to addressing data scarcity. To address this problem, we introduce $\textbf{STRIDE}$, a generative molecule workflow that generates novel molecules with an unconditional generative model guided by known molecules without any retraining. We generate molecules outside of the training data from a highly specialized set of antioxidant molecules. Our generated molecules on average 21.7\% lower synthetic accessibility scores and also reduce ionization potential by 5.9\% of generated molecules via guiding.
|
Shehtab Zaman · Denis Akhiyarov · Mauricio Araya-Polo · Kenneth Chiu 🔗 |
-
|
Stoichiometry Representation Learning with Polymorphic Crystal Structures
(
Poster
)
>
link
Despite the recent success of machine learning (ML) in materials science, its success heavily relies on the structural description of crystal, which is itself computationally demanding and occasionally unattainable. Stoichiometry descriptors can be an alternative approach, which reveals the ratio between elements involved to form a certain compound without any structural information. However, it is not trivial to learn the representations of stoichiometry due to the nature of materials science called polymorphism, i.e., a single stoichiometry can exist in multiple structural forms due to the flexibility of atomic arrangements, inducing uncertainties in representation. To this end, we propose PolySRL, which learns the probabilistic representation of stoichiometry by utilizing the readily available structural information, whose uncertainty reveals the polymorphic structures of stoichiometry. Extensive experiments on sixteen datasets demonstrate the superiority of PolySRL, and analysis of uncertainties shed light on the applicability of PolySRL in real-world material discovery. |
Namkyeong Lee · Heewoong Noh · Gyoung S. Na · Tianfan Fu · Jimeng Sun · Chanyoung Park 🔗 |
-
|
Exciton-Polariton Condensates: A Fourier Neural Operator Approach
(
Poster
)
>
link
Advancements in semiconductor fabrication over the past decade have catalyzed extensive research into all-optical devices driven by exciton-polariton condensates. Preliminary validations of such devices, including transistors, have shown encouraging results even under ambient conditions. A significant challenge still remains for large scale application however: the lack of a robust solver that can be used to simulate complex nonlinear systems which require an extended period of time to stabilize. Addressing this need, we propose the application of a machine-learning-based Fourier Neural Operator approach to find the solution to the Gross-Pitaevskii equations coupled with extra exciton rate equations. This work marks the first direct application of Neural Operators to an exciton-polariton condensate system. Our findings show that the proposed method can predict final-state solutions to a high degree of accuracy almost 1000 times faster than CUDA-based GPU solvers. Moreover, this paves the way for potential all-optical chip design workflows by integrating experimental data. |
Surya Sathujoda · Yuan Wang · Kanishk Gandhi 🔗 |
-
|
AI, Robot Neuroscientist: Reimagining Hypothesis Generation
(
Poster
)
>
link
Neuroscience has long relied on human-conceived hypotheses, yet the brain's complexity fundamentally challenges this epistemology. Modern technologies and the large-scale data collection they enable throw this challenge into sharp relief. We champion the potential of AI for neuroscience exploration. We highlight both implicit, 'uninterpretable' models as aids in hypothesis formulation and symbolic regression for explicit hypothesis generation. For researchers from non-neuroscience backgrounds, we discuss domain-specific considerations in integrating AI into neuroscience research. By spotlighting the underexplored avenues for AI to accelerate neuroscience, we aim to induce both communities toward these exciting research opportunities. |
Jiaqi Shang · Jiaqi Shang · Will Xiao · Will Xiao 🔗 |
-
|
Baking Symmetry into GFlowNets
(
Oral
)
>
link
GFlowNets have exhibited promising performance in generating diverse candidates with high rewards. These networks generate objects incrementally and aim to learn a policy that assigns probability of sampling objects in proportion to rewards. However, the current training pipelines of GFlowNets do not consider the presence of isomorphic actions, which are actions resulting in symmetric or isomorphic states. This lack of symmetry increases the amount of samples required for training GFlowNets and can result in inefficient and potentially incorrect flow functions. As a consequence, the reward and diversity of the generated objects decrease. In this study, our objective is to integrate symmetries into GFlowNets by identifying equivalent actions during the generation process. Experimental results using synthetic data demonstrate the promising performance of our proposed approaches. |
George Ma · Emmanuel Bengio · Yoshua Bengio · Dinghuai Zhang 🔗 |
-
|
Shape Arithmetic Expressions
(
Poster
)
>
link
Symbolic regression has excelled in uncovering equations from physics, chemistry, biology, and related disciplines. However, its effectiveness becomes less certain when applied to experimental data lacking inherent closed-form expressions. Empirically derived relationships, such as entire stress-strain curves, may defy concise closed-form representation, compelling us to explore more adaptive modeling approaches that balance flexibility with interpretability. In our pursuit, we turn to Generalized Additive Models (GAMs), a widely used class of models known for their versatility across various domains. Although GAMs can capture non-linear relationships between variables and targets, they cannot capture intricate feature interactions. In this work, we investigate both of these challenges and propose a novel class of models, Shape Arithmetic Expressions (SHAREs), that fuses GAM's flexible shape functions with the complex feature interactions found in mathematical expressions. SHAREs also provide a unifying framework for both of these approaches. We also design a set of rules for constructing SHAREs that guarantee transparency of the found expressions beyond the standard constraints based on the model's size. |
Krzysztof Kacprzyk · Mihaela van der Schaar 🔗 |
-
|
Evaluating Uncertainty Quantification approaches for Neural PDEs in scientific application
(
Poster
)
>
link
The accessibility of spatially distributed data, enabled by affordable sensors, field, and numerical experiments, has facilitated the development of data-driven solutions for scientific problems, including climate change, weather prediction, and urban planning. Neural Partial Differential Equations (Neural PDEs), which combine deep learning (DL) techniques with domain expertise (e.g., governing equations) for parameterization, have proven to be effective in capturing valuable correlations within spatiotemporal datasets. However, sparse and noisy measurements coupled with modeling approximation introduce aleatoric and epistemic uncertainties. Therefore, quantifying uncertainties propagated from model inputs to outputs remains a challenge and an essential goal for establishing the trustworthiness of Neural PDEs. This work evaluates various Uncertainty Quantification (UQ) approaches for both Forward and Inverse Problems in scientific applications. Specifically, we investigate the effectiveness of Bayesian methods, such as Hamiltonian Monte Carlo (HMC) and Monte-Carlo Dropout (MCD), and a more conventional approach, Deep Ensembles (DE). To illustrate their performance, we take two canonical PDEs: Burger's equation and the Navier-Stokes equation. Our results indicate that Neural PDEs can effectively reconstruct flow systems and predict the associated unknown parameters. However, it is noteworthy that the results derived from Bayesian methods, based on our observations, tend to display a higher degree of certainty in their predictions as compared to those obtained using the DE. This elevated certainty in predictions suggests that Bayesian techniques might underestimate the true underlying uncertainty, thereby appearing more confident in their predictions than the DE approach. |
Vardhan Dongre · Gurpreet Singh Hora 🔗 |
-
|
Relaxed Octahedral Group Convolution for Learning Symmetry Breaking in 3D Physical Systems
(
Poster
)
>
link
Deep equivariant models use symmetries to improve sample efficiency and generalization. However, the assumption of perfect symmetry in many of these models can sometimes be restrictive, especially when the data does not perfectly align with such symmetries. Thus, we introduce relaxed octahedral group convolution for modeling 3D physical systems in this paper. This flexible convolution technique provably allows the model to both maintain the highest level of equivariance that is consistent with data and discover the subtle symmetry-breaking factors in the physical systems. Empirical results validate that our approach can not only provide insights into the symmetry-breaking factors in phase transitions but also achieves superior performance in fluid super-resolution tasks. |
Rui Wang · Robin Walters · Tess Smidt 🔗 |
-
|
Higher Order Equivariant Graph Neural Networks for Charge Density Prediction
(
Poster
)
>
link
The calculation of electron density distribution in materials and molecules is central to the study of their quantum and macro-scale properties, yet accurate and efficient calculation remains a long-standing challenge in the field of material science. This work introduces ChargE3Net, an E(3)-equivariant graph neural network for predicting electron density in atomic systems. Unlike existing methods, ChargE3Net achieves equivariance through the use of higher-order tensor representations, and directly predicts the charge density at a set of desired locations. We demonstrate the effectiveness of ChargE3Net on large and diverse sets of molecules and materials, where it achieves state-of-the-art performance over existing methods, and scales to larger systems than what is feasible to compute with density functional theory. Through additional experimentation, we demonstrate the effect of introducing higher-order equivariant representations, and why they yield performance improvements in the charge density prediction setting. |
Teddy Koker · Keegan Quigley · Lin Li 🔗 |
-
|
ChatPathway: Conversational Large Language Models for Biology Pathway Detection
(
Poster
)
>
link
Biological pathways, like protein-protein interactions and metabolic networks, are vital for understanding diseases and drug development. Some databases such as KEGG are designed to store and map these pathways. However, many bioinformatics methods face limitations due to database constraints, and certain deep learning models struggle with the complexities of biochemical reactions involving large molecules and diverse enzymes. Importantly, the thorough exploration of biological pathways demands a deep understanding of scientific literature and past research. Despite this, recent advancements in Large Language Models (LLMs), especially ChatGPT, show promise. We first restructured data from KEGG and augmented it with molecule structural and functional information sourced from UniProt and PubChem. Our study evaluated LLMs, particularly GPT-3.5-turbo and Galactica, in predicting biochemical reactions and pathways using our constructed data. We also assessed its ability to predict novel pathways, not covered in its training dataset, using findings from recently published studies. While GPT demonstrated strengths in pathway mapping, Galactica encountered challenges. This research emphasizes the potential of merging LLMs with biology, suggesting a harmonious blend of human expertise and AI in decoding biological systems. |
Yanjing Li · Hannan Xu · Haiteng Zhao · Hongyu Guo · Shengchao Liu 🔗 |
-
|
A Transformer Model for Symbolic Regression towards Scientific Discovery
(
Oral
)
>
link
Symbolic Regression (SR) searches for mathematical expressions which best describe numerical datasets. This allows to circumvent interpretation issues inherent to artificial neural networks, but SR algorithms are often computationally expensive. This work proposes a new Transformer model aiming at Symbolic Regression particularly focused on its application for Scientific Discovery. We propose three encoder architectures with increasing flexibility but at the cost of column-permutation equivariance violation. Training results indicate that the most flexible architecture is required to prevent from overfitting. Once trained, we apply our best model to the SRSD datasets (Symbolic Regression for Scientific Discovery datasets) which yields state-of-the-art results using the normalized tree-based edit distance, at no extra computational cost. |
Florian Lalande · Yoshitomo Matsubara · Naoya Chiba · Tatsunori Taniai · Ryo Igarashi · Yoshitaka Ushiku 🔗 |
-
|
Using Foundation Models to Promote Digitization and Reproducibility in Scientific Experimentation
(
Poster
)
>
link
Accelerating scientific discovery through AI relies on the availability of high-quality data from scientific experimentation. Yet, scientific experimentation suffers from poor reproducibility and data capture challenges, mostly stemming from the difficulty in transcribing all details of an experiment and the different ways in which individuals document their lab work. With the emergence of foundation models capable of processing multiple data modalities including vision and language, there is a unique opportunity to redefine data and metadata capture and the corresponding scientific documentation process.In this contribution, we discuss the challenges associated with lab digitization today and how multi-modal learning with transformer-based architectures can contribute to a new research infrastructure for scientific discovery in order to fully describe experimental methods and outcomes while facilitating data sharing and collaboration. We present a case study on a hybrid digital infrastructure and transformer-based vision-language models to transcribe high-dimensional raw data streams from non-invasive recording devices that represent the interaction of researchers with lab environments during scientific experimentation. The infrastructure is demonstrated in test cases related to semiconductor research and wet chemistry, where we show how vision-language foundation models fine-tuned on a limited set of experiments can be used to generate reports that exhibit high similarity with the recorded procedures. Our findings illustrate the feasibility of using foundation models to automate data capture and digitize all aspects of scientific experimentation, and suggest that the challenge of scarce training data for specific laboratory procedures can be alleviated by leveraging self-supervised pretraining on more abundant data from other domains. |
Amol Thakkar · Andrea Giovannini · Antonio Foncubierta-Rodriguez · Carlo Baldassari · Dimitrios Christofidellis · Federico Zipoli · Gianmarco Gabrieli · Jannis Born · Mara Graziani · Marvin Alberts · Matteo Manica · Michael Stiefel · Oliver Schilter · Teodoro Laino · Patrick Ruch
|
-
|
Bad Exoplanet! Explaining Degraded Performance when Reconstructing Exoplanets Atmospheric Parameters
(
Poster
)
>
link
Deep learning techniques have been widely adopted to automate the reconstruction of atmospheric parameters in exoplanets, at a fraction of the computational cost required by traditional approaches. However, many of the reconstruction models used are intrinsically non-interpretable. With this work, we aim to produce descriptions for the characteristics of exoplanets that make their atmospheric composition reconstruction problematic. We present a model-agnostic approach to detect biased data subgroups described via atmospheric parameters such as planet distance and surface gravity. We show that adopting an ensemble approach remarkably improves the quality of the outcomes overall, as well as at the subgroup level, on synthetic data simulated for the upcoming Ariel space mission. Experimental results further demonstrate the effectiveness of adopting explanation techniques in identifying and describing significant performance gaps between weak learners and their ensemble. Our work provides a more nuanced description of the results provided by deep learning techniques, to enable more meaningful assessments of what can be reasonably achieved with them. |
Alkis Koudounas · Flavio Giobergia · Elena Baralis 🔗 |
-
|
XBrainLab: An Open-Source Software for Explainable Artificial Intelligence-Based EEG Analysis
(
Poster
)
>
link
Recent advancements in explainable artificial intelligence have significantly accelerated scientific discoveries across various fields. In the realm of neuroscience research, the application of deep interpretation techniques has yielded valuable insights into brain functioning and mechanisms. We introduce XBrainLab, an accessible EEG analysis tool featuring a user-friendly graphical user interface (GUI) seamlessly compatible with code scripting. XBrainLab offers a comprehensive, end-to-end deep learning EEG analysis pipeline, capable of converting raw EEG signals into comprehensible visualizations of neural patterns. Through practical demonstrations using diverse EEG datasets, we highlight XBrainLab's versatility in interpreting neural representations in alignment with established neuroscience knowledge. This evolving open-source platform bridges cutting-edge computational techniques with the forefront of neuroscientific research. The code repository can be accessed at https://anonymous.4open.science/r/XBrainLab-21F8/README.md. |
Chia-ying Hsieh · Jing-Lun Chou · Yu-Hsin Chang · Chun-Shu Wei 🔗 |
-
|
Learning Inter-Graph Interactions Between Heterogeneous Substructures of Chemical Systems
(
Poster
)
>
link
Complex chemical systems containing heterogeneous substructures are common in real-world applications. Various physical phenomena of the complex chemical systems are derived from the interactions between the heterogeneous substructures. However, existing graph representation learning methods for inter-graph interactions assumed graph-level interactions between homogeneous structures, such as organic molecules and inorganic crystalline materials. We propose a data descriptor of the complex chemical systems and a graph neural network for learning inter-graph interactions between organic and inorganic compounds. We applied the proposed method to predict the physical properties of hybrid solar cell materials containing heterogeneous substructures, which have received significant attention for sustainable energy resources. By learning heterogeneous inter-graph interactions, the proposed method achieved state-of-the-art accuracy in predicting band gaps of 1,682 hybrid solar cell materials. |
Gyoung S. Na 🔗 |
-
|
Compositional Generative Inverse Design
(
Poster
)
>
link
Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method outperforms state-of-the-art neural inverse design method for the N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task. |
Tailin Wu · Takashi Maruyama · Long Wei · Tao Zhang · Yilun Du · Yilun Du · Gianluca Iaccarino · Jure Leskovec · Jure Leskovec 🔗 |
-
|
MoleCLUEs: Molecular Conformers Maximally In-Distribution for Predictive Models
(
Poster
)
>
link
Structure-based molecular ML (SBML) models can be highly sensitive to input geometries and give predictions with large variance.We present an approach to mitigate the challenge of selecting conformations for such models by generating conformers that explicitly minimize predictive uncertainty. To achieve this, we compute estimates of aleatoric and epistemic uncertainties that are differentiable w.r.t. latent posteriors. We then iteratively sample new latents in the direction of lower uncertainty by gradient descent. As we train our predictive models jointly with a conformer decoder, the new latent embeddings can be mapped to their corresponding inputs, which we call MoleCLUEs, or (molecular) counterfactual latent uncertainty explanations (Antorán et al, 2020). We assess our algorithm for the task of predicting drug-target binding from 3D structure with maximum confidence. We additionally analyze the structure trajectories obtained from conformer optimizations, which provide insight into the sources of uncertainty in SBML. |
Michael Maser · Nataša Tagasovska · Jae Hyeon Lee · Andrew Watkins 🔗 |
-
|
scBiGNN: Bilevel Graph Representation Learning for Cell Type Classification from Single-cell RNA Sequencing Data
(
Poster
)
>
link
Single-cell RNA sequencing (scRNA-seq) technology provides high-throughput gene expression data to study the cellular heterogeneity and dynamics of complex organisms. Graph neural networks (GNNs) have been widely used for automatic cell type classification, which is a fundamental problem to solve in scRNA-seq analysis. However, existing methods do not sufficiently exploit both gene-gene and cell-cell relationships, and thus the true potential of GNNs is not realized. In this work, we propose a bilevel graph representation learning method, named scBiGNN, to simultaneously mine the relationships at both gene and cell levels for more accurate single-cell classification. Specifically, scBiGNN comprises two GNN modules to identify cell types. A gene-level GNN is established to adaptively learn gene-gene interactions and cell representations via the self-attention mechanism, and a cell-level GNN builds on the cell-cell graph that is constructed from the cell representations generated by the gene-level GNN. To tackle the scalability issue for processing a large number of cells, scBiGNN adopts an Expectation Maximization (EM) framework in which the two modules are alternately trained via the E-step and M-step to learn from each other. Through this interaction, the gene- and cell-level structural information is integrated to gradually enhance the classification performance of both GNN modules. Experiments on benchmark datasets demonstrate that our scBiGNN outperforms a variety of existing methods for cell type classification from scRNA-seq data. |
Rui Yang · Wenrui Dai · Chenglin Li · Junni Zou · Dapeng Wu · Hongkai Xiong 🔗 |
-
|
Rethinking Bayesian Optimization with Gaussian Processes: Insights from Hyperspectral Trait Search
(
Poster
)
>
link
The application of Bayesian Optimization using Gaussian Processes (BO-GP) for global optimization problems is ubiquitous across scientific disciplines because, beyond good performance, it supports exact inference, interpretability, and straightforward uncertainty quantification. In this paper, we revisit the biological application of BO-GP in searching trait spaces for genomic prediction, which uses genome-wide marker information to predict breeding values for agronomically important traits. Genomic predictions help breeders select desirable plants earlier in the field season without waiting to observe traits later. While these search spaces are known to be sharp and aperiodic, BO-GP is considered a feasible approach. However, our work finds that a simple random search surprisingly achieves comparable performance to BO-GP while requiring significantly less computing cost. Through a careful investigation, we can explain this observation as a fundamental limitation of BO-GP for sharp and aperiodic functions -- where the incompatible structure results in samples similar to random search but with higher computational cost. Our results highlight a blind spot in the current use of BO-GP for scientific applications, such as trait prediction, with sharp and aperiodic search spaces. |
Ruhana Azam · Sanmi Koyejo · Samuel Fernandes · Mohammed Kebir · Andrew Leakey · Alexander Lipka 🔗 |
-
|
PreDiff: Precipitation Nowcasting with Latent Diffusion Models
(
Poster
)
>
link
Earth system forecasting has traditionally relied on complex physical models that are computationally expensive and require significant domain expertise. In the past decade, the unprecedented increase in spatiotemporal Earth observation data has enabled data-driven forecasting models using deep learning techniques. These models have shown promise for diverse Earth system forecasting tasks but either struggle with handling uncertainty or neglect domain-specific prior knowledge, resulting in averaging possible futures to blurred forecasts or generating physically implausible predictions. To address these limitations, we propose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1) We develop PreDiff, a conditional latent diffusion model capable of probabilistic forecasts. 2) We incorporate an explicit knowledge control mechanism to align forecasts with domain-specific physical constraints. This is achieved by estimating the deviation from imposed constraints at each denoising step and adjusting the transition distribution accordingly. We conduct empirical studies on two datasets: N-body MNIST, a synthetic dataset with chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset. Specifically, we impose the law of conservation of energy in N-body MNIST and anticipated precipitation intensity in SEVIR. Experiments demonstrate the effectiveness of PreDiff in handling uncertainty, incorporating domain-specific prior knowledge, and generating forecasts that exhibit high operational utility. |
Zhihan Gao · Xingjian Shi · Boran Han · Hao Wang · Xiaoyong Jin · Danielle Maddix · Yi Zhu · Mu Li · Yuyang (Bernie) Wang 🔗 |
-
|
Explaining Drug Repositioning: A Case-Based Reasoning Graph Neural Network Approach
(
Poster
)
>
link
Drug repositioning, the identification of novel uses of existing therapies, has become an attractive strategy to accelerate drug development. Recently, knowledge graphs (KGs) have emerged as a powerful representation of interconnected data within the biomedical domain. While biomedical KGs can be used to predict new connections between compounds and diseases, most approaches only state whether two nodes are related. Yet, they fail to explain why two nodes are related. In this project, we introduce an implementation of the semi-parametric Case-Based Reasoning over subgraphs approach (CBR-SUBG), designed to derive the underlying mechanisms for a drug query by gathering graph patterns of similar entities. We show that our adaptation outperforms existing KG link prediction models on a drug repositioning task. Furthermore, our findings demonstrate that CBR-SUBG strategy can not only rank potential repositioning candidates but also provide interpretable biological paths, leading to more informed decisions. |
Adriana Carolina Gonzalez Cavazos 🔗 |
-
|
AI Ethics Education for Scientists
(
Poster
)
>
link
Machine learning (ML) and artificial intelligence (AI) are becoming core components of scientific research across fields. While there are increasingly formal and informal domain-specific learning opportunities for students and early career scientists interested in AI/ML, AI ethics is often an overlooked part of these trainings. This is a concerning practice as a knowledge of the ethical considerations around AI/ML is an essential component of training effective and responsible scientists. This work presents a introductory AI Ethics curriculum tailored for scientists and describes implementations of the curriculum in various training scenarios. |
Savannah Thais 🔗 |
-
|
Harmonic Prior Self-conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design
(
Oral
)
>
link
A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon the state-of-the-art generative processes for docking in simplicity, generality, and performance. Enabled by this structure model, FlowSite designs binding sites substantially better than baseline approaches and provides the first general solution for binding site design. |
Hannes Stärk · Bowen Jing · Regina Barzilay · Tommi Jaakkola 🔗 |
-
|
Modelling Microbial Communities with Graph Neural Networks
(
Poster
)
>
link
Understanding the interactions and interplay of microorganisms is a great challenge with many applications in medical and environmental settings.In this work, we model bacterial communities directly from their genomes using graph neural networks (GNNs). GNNs leverage the inductive bias induced by the set nature of bacteria, enforcing permutation invariance and granting combinatorial generalization. We propose to learn the dynamics implicitly by directly predicting community relative abundance profiles at steady state, thus escaping the need for growth curves. On two real-world datasets, we show for the first time generalization to unseen bacteria and different community structures. To investigate the prediction results more deeply, we create a simulation for flexible data generation and analyze effects of bacteria interaction strength, community size, and training data amount. |
Albane Ruaud · Cansu Sancaktar · Marco Bagatella · Christoph Ratzke · Georg Martius 🔗 |
-
|
Scalable Diffusion for Materials Generation
(
Poster
)
>
link
Generative models trained on internet-scale data are capable of generating novel and realistic texts, images, and videos. A natural next question is whether these models can advance science, for example by generating novel stable materials. Traditionally, models with explicit structures (e.g., graphs) have been used in modeling structural relationships in scientific data (e.g., atoms and bonds in crystals), but generating structures can be difficult to scale to large and complex systems. Another challenge in generating materials is the mismatch between standard generative modeling metrics and downstream applications. For instance, common metrics such as the reconstruction error do not correlate well with the downstream goal of discovering novel stable materials. In this work, we tackle the scalability challenge by developing a unified crystal representation that can represent any crystal structure (UniMat), followed by training a diffusion probabilistic model on these UniMat representations. Our empirical results suggest that despite the lack of explicit structure modeling, UniMat can generate high fidelity crystal structures from larger and more complex chemical systems, outperforming previous graph-based approaches under various generative modeling metrics. To better connect the generation quality of materials to downstream applications, such as discovering novel stable materials, we propose additional metrics for evaluating generative models of materials, including per-composition formation energy and stability with respect to convex hulls through decomposition energy from Density Function Theory (DFT). Lastly, we show that conditional generation with UniMat can scale to previously established crystal datasets with up to millions of crystals structures, outperforming random structure search (the current leading method for structure discovery) in discovering new stable materials. |
Sherry Yang · KwangHwan Cho · Amil Merchant · Pieter Abbeel · Dale Schuurmans · Igor Mordatch · Ekin Dogus Cubuk 🔗 |
-
|
Towards out-of-distribution generalizable predictions of chemical kinetics properties
(
Poster
)
>
link
Machine Learning (ML) techniques have found applications in estimating chemical kinetics properties. With the accumulated drug molecules identified through ``AI4drug discovery'', the next imperative lies in AI-driven design for high-throughput chemical synthesis processes, with the estimation of properties of unseen reactions with unexplored molecules. To this end, the existing ML approaches for kinetics property prediction are required to be Out-Of-Domain (OOD) generalizable. In this paper, we categorize the OOD kinetic property prediction into three levels (structure, condition, and mechanism), revealing unique aspects of such problems. Under this framework, we create comprehensive datasets to benchmark (1) the state-of-the-art ML approaches for reaction prediction in the OOD setting and (2) the state-of-the-art graph OOD methods in kinetics property prediction problems. Our results demonstrated the challenges and opportunities in OOD kinetics property prediction. Our datasets and benchmarks can further support research in this direction. |
Zihao Wang · Yongqiang Chen · Yang Duan · Weijiang Li · Bo Han · James Cheng · Hanghang Tong · Hanghang Tong 🔗 |
-
|
Single-cell Masked Autoencoder: An Accurate and Interpretable Automated Immunophenotyper
(
Poster
)
>
link
High-throughput single-cell cytometry data are crucial for understanding the immune system’s role in diseases and treatment response. However, the prevailing methods used for analyzing cytometry data, specifically manual gating and clustering methods, have certain limitations with scalability, robustness, and accuracy. In this study, we propose a single-cell masked autoencoder (scMAE), which offers an automated solution for immunophenotyping tasks such as cell type prediction. Our model aims to preserve the cell type definitions designed by the user, making interpretation and cross-study comparisons more accessible. The scMAE model follows a pre-train and fine-tune paradigm. During pre-training, scMAE utilizes Masked Single-cell Modelling (MScM) to learn relationships between protein markers in immune cells without the need for prior labeling information. Subsequently, the scMAE is fine-tuned on multiple specialized tasks, using a smaller designated portion of labeled data. Through evaluation experiments, we demonstrated that the pre-trained scMAE overcomes limitations of manual gating and clustering methods, providing accurate and interpretable cellular immunophenotyping. The introduction of scMAE represents a significant advancement in immunology research, enabling prediction and interpretation of cellular-level in immune disease. |
Jaesik Kim · Matei Ionita · Matthew Lee · Michelle McKeague · Ajinkya Pattekar · Mark Painter · Joost Wagenaar · Van Q. Truong · Dylan Norton · Divij Mathew · Yonghyun Nam · Sokratis Apostolidis · Patryk Orzechowski · Sang-Hyuk Jung · Jakob Woerner · Yidi Huang · Nuala Meyer · Allison Greenplate · Dokyoon Kim · John Wherry
|
-
|
AI Framework for Generative Design of Computational Experiments with Structures in Physical Environment
(
Poster
)
>
link
We discuss the applicability of an open-source generative design for the automated design of computational experiments with structures in physical environments for various scientific fields. It may be used for scientific experiments where the searched structure can be represented as a set of 2D non-oriented graphs with any topology (grids, polygons, trees), and the physical environment can be described with any numerical model (classic or data-driven). The proposed framework gives the tools to efficiently explore a space of experiment configurations with generative AI models and evolutionary algorithms. The results are shown in examples from different fields: design of microfluidic devices, coastal engineering, research on heat transfer, and acoustics. Due to the framework's focus on working with structures as graphs, it is possible to pre-train generative NN that is used to create an initial population of optimized structures. The framework finds application in diverse areas such as coastal engineering, acoustics, engineering design, heat transfer, hydrodynamics, and medicine. |
Gleb Solovev · Anna Kalyuzhnaya · Alexander Hvatov · Nikita Starodubcev · Oleg Petrov · Nikolay Nikitin 🔗 |
-
|
Molecule-edit templates for efficient and accurate retrosynthesis prediction
(
Poster
)
>
link
Retrosynthesis involves determining a sequence of reactions to synthesize complex molecules from simpler precursors. As this poses a challenge in organic chemistry, machine learning has offered solutions, particularly for predicting possible reaction substrates for a given target molecule. These solutions mainly fall into template-based and template-free categories. The former is efficient but relies on a vast set of predefined reaction patterns, while the latter, though more flexible, can be computationally intensive and less interpretable. To address these issues, we introduce METRO (Molecule-Edit Templates for RetrOsynthesis), a machine-learning model that predicts reactions using minimal templates - simplified reaction patterns capturing only essential molecular changes - reducing computational overhead and achieving state-of-the-art results on standard benchmarks. |
Mikołaj Sacha · Michał Sadowski · Piotr Kozakowski · Ruard van Workum · Stanislaw Jastrzebski 🔗 |
-
|
Machine Learning for Blockchain
(
Poster
)
>
link
Blockchain, often heralded as the decentralized artificial intelligence and the foundational infrastructure of web3, promises to reshape industries by providing a secure, transparent, and decentralized way of recording transactions. However, despite its transparency and decentralization, blockchain faces limitations due to insufficient AI integration, impeding widespread adoption. This paper underscores the imperative of integrating machine learning (ML) to surmount these challenges, emphasizing blockchain's secure, decentralized foundation and its need for enhanced intelligence. The fusion of blockchain and ML is pivotal for overcoming constraints and unleashing the technology's full potential. This paper explores ML's empowering role in the infrastructure, application, and cross-chain dimensions, bolstering security, efficiency, and interoperability. This synergy addresses blockchain's limitations, broadening its applications and paving the way for its promise to be fully realized for the benefit of individuals and organizations. |
Luyao Zhang · Luyao Zhang 🔗 |
-
|
Sample-efficient Antibody Design through Protein Language Model for Risk-aware Batch Bayesian Optimization
(
Poster
)
>
link
Antibody design is a time-consuming and expensive process that often requires extensive experimentation to identify the best candidates. To address this challenge, we propose an efficient and risk-aware antibody design framework that leverages protein language models (PLMs) and batch Bayesian optimization (BO). Our framework utilizes the generative power of protein language models to predict candidate sequences with higher naturalness and a Bayesian optimization algorithm to iteratively explore the sequence space and identify the most promising candidates. To further improve the efficiency of the search process, we introduce a risk-aware approach that balances exploration and exploitation by incorporating uncertainty estimates into the acquisition function of the Bayesian optimization algorithm. We demonstrate the effectiveness of our approach through experiments on several benchmark datasets, showing that our framework outperforms state-of-the-art methods in terms of both efficiency and quality of the designed sequences. Our framework has the potential to accelerate the discovery of new antibodies and reduce the cost and time required for antibody design. |
Yanzheng Wang · TIANYU SHI · Jie Fu 🔗 |
-
|
Easy to learn hard to master - how to solve an arbitrary equation with PINN
(
Poster
)
>
link
Physics-informed neural networks (PINNs) offer predictive capabilities for processes defined by known equations and limited data. While custom architectures and loss computations are often designed for each equation, the untapped potential of classical architectures remains unclear. To make a comprehensive study, it is required to compare performance of a given neural network architecture and loss formulation for different types of equations. This paper introduces an open-source framework for unified handling of ordinary differential equations (ODEs), partial differential equations (PDEs), and their systems. We explore PINN applicability and convergence comprehensively, demonstrating its performance across ODEs, PDEs, ODE systems, and PDE systems. |
Alexander Hvatov · Damir Aminev · Nikita Demyanchuk 🔗 |
-
|
Molecule Design by Latent Prompt Transformer
(
Poster
)
>
link
This paper proposes a latent prompt Transformer model for solving challenging optimization problems such as molecule design, where the goal is to find molecules with optimal values of a target chemical or biological property that can be computed by an existing software. Our proposed model consists of three components. (1) A latent vector whose prior distribution is modeled by a Unet transformation of a Gaussian white noise vector. (2) A molecule generation model that generates the string-based representation of molecule conditional on the latent vector in (1). We adopt the causal Transformer model that takes the latent vector in (1) as prompt. (3) A property prediction model that predicts the value of the target property of a molecule based on a non-linear regression on the latent vector in (1). We call the proposed model the latent prompt Transformer model. After initial training of the model on existing molecules and their property values, we then gradually shift the model distribution towards the region that supports desired values of the target property for the purpose of molecule design. Our experiments show that our proposed model achieves state of the art performances on several benchmark molecule design tasks. |
Deqian Kong · Yuhao Huang · Jianwen Xie · Ying Nian Wu 🔗 |