HyperHELM: Hyperbolic Hierarchy Encoding for mRNA Language Modeling
Abstract
Language models are increasingly applied to biological sequences like proteins and mRNA, yet their default Euclidean geometry may mismatch the hierarchical structures inherent to biological data. While hyperbolic geometry provides a better alternative for accommodating hierarchical data, it has yet to find a way into language modeling for mRNA sequences. In this work, we introduce HyperHELM, a framework that implements masked language model pre-training in hyperbolic space for mRNA sequences. Using a hybrid design with hyperbolic layers atop Euclidean backbones, HyperHELM aligns representations with biological hierarchy defined by the relationship between mRNA and amino acids. Across multiple multi-species datasets, it outperforms Euclidean baselines on 8 of 9 tasks involving property prediction, with 10\% improvement on average, and excels in out-of-distribution generalization to long, low-GC sequences; for antibody region annotation, it surpasses hierarchy-aware Euclidean models by 3\% in annotation accuracy. Our results highlight hyperbolic geometry as an effective inductive bias for language modeling of mRNA sequences.