Motif-aware Tokenization of the Genome: Towards Interpretable Modeling of Gene Regulation
Abstract
Genomic Language Models achieve strong performance on biological tasks but rely on tokenization methods that overlook the complexity of the genome. We introduce a biologically grounded tokenization strategy that partitions the DNA sequence into meaningful “words” based on transcription factor (TF) motifs. Embedding biological insight into vocabulary design preserves predictive power while potentially improving interpretability and computational efficiency. Proof-of-concept results demonstrate that motif-informed tokens generate representations that better capture the language of gene regulation, opening the door to models that are both highly predictive and capable of decoding the regulatory genomic grammar vital for drug discovery and precision medicine.