OpenMETAGENE: Large-Scale, Diverse, and Open Data Recipes for Multimodal Metagenomics Models
Shangshang Wang · Ollie Liu · jiarui zhang · Willie Neiswanger
Abstract
This proposal outlines the creation of OpenMETAGENE, a large-scale, multimodal dataset designed to bridge the gap between raw metagenomic sequences and human-interpretable biological insights. While many genomic models focus on long, contiguous gene sequences from single organisms (e.g., Human Reference Genome), our work centers on pairing short-read diverse metagenomic data with rich annotations necessary to train a new generation of foundation models capable of complex reasoning for biosurveillance and personalized health.
Chat is not available.
Successful Page Load