From Base Pairs to Functions: Rich RNA Representations via Multimodal Language Modeling
Abstract
RNA foundation models have recently emerged as powerful tools for learning from large sequence databases, yet their embeddings often fall short in simple probing setups, necessitating additional finetuning. Most existing models are pretrained solely on sequences, assuming that structural information will emerge implicitly. We introduce RABONA, a multimodal RNA language model jointly pretrained on sequence-structure pairs using modality-specific masking, designed for both generative and understanding tasks. It produces embeddings that form clearer family-specific clusters and shows stronger attention alignment with RNA base pairs compared to other RNA language models. In this paper, we focus on RABONA's predictive capabilities and show that it consistently outperforms larger baselines across diverse downstream tasks in both finetuning and linear probing setups, demonstrating that incorporating structure during pretraining yields richer RNA embeddings and enables more efficient foundation models.