Closing the Omics Gap: A Benchmark for Unified Modeling for Biomolecular Foundation Models
Abstract
In recent years, biomolecular foundation models (bioFMs) trained on massive amounts of omics data have demonstrated remarkable predictive capabilities across broad applications in biotechnology. However, the majority of existing bioFMs are unimodal, trained exclusively on DNA or protein sequences, and are evaluated with tasks specific to their sequence type. With few benchmarks designed around multi-omics data, opportunities for cross-modal evaluations are limited. Given recent evidence that multimodal bioFMs achieve more robust sequence understanding, there is an open need for additional benchmarks linking omics types. In response, we propose a novel cross-modal benchmark linking genomic and proteomic information to biological outcomes. The benchmark will prioritize translatability and cross-modal relevance; each example will list relevant nucleotide and amino acid sequences, provide a mapping between the two modalities, and be labeled with a functional attribute of the encoded protein(s). Given a pair of genes, our benchmark will pose questions such as: Do the encoded proteins co-localize or share similar functions? Are the genes associated with a common disease or targeted by a common drug? As an initial step toward this vision, we introduce Geno-Prot, a genomic–proteomic benchmark linking human sequences to nine functional attributes. With Geno-Prot, we show that simple ensembles of bioFMs consistently outperform unimodal ensembles, motivating further development of multimodal benchmarks for new species and prediction tasks.