MetaOmics-10T: The Foundational Dataset to Unlock Causal Modeling of Microbial Ecosystems
Abstract
We propose MetaOmics-10T—an openly shareable, foundational dataset to unlock AI-accelerated discovery in microbial ecosystems. The dataset directly enables three high-impact AI tasks: (1) forecasting ecosystem dynamics, (2) predicting counterfactual outcomes of interventions, and (3) inverse-design of microbial therapies under safety constraints. MetaOmics-10T combines 10 trillion base pairs reclaimed from public archives using a Quality-Aware Tokenization (QA-Token) framework with 100,000+ interventional trajectories generated via model-guided experimental design. The result is a first-of-its-kind, probabilistic, intervention-ready corpus that addresses the principal bottleneck for causal modeling in microbiome science and provides an empirical testbed to assess the reach and limits of causal inference at scale.