Patient-level prediction from single-cell data using attention-based multiple instance learning with regulatory priors
Abstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution characterization of heterogeneous cellular populations, but predictive modeling remains fundamentally limited in clinical settings where outcomes are defined at the sample level. This problem is especially acute in contexts like chimeric antigen receptor (CAR) T cell therapy, where infused cellular products vary dramatically across patients and lie outside the training distributions of existing single-cell foundation models. Compounding this, strong batch effects across cohorts obscure true biological signals and hinder generalization. We introduce tcellMIL, a biologically informed multiple instance learning (MIL) framework that models each patient sample as a bag of unlabeled cells to predict therapeutic response. tcellMIL incorporates prior biological knowledge by leveraging SCENIC, a gene regulatory network inference method that uses known transcription factor binding motifs to compute regulon activity scores — biologically grounded features that reduce dimensionality and mitigate batch effects. These features are denoised via a self-supervised autoencoder and combined with explicit batch encoding to improve cross-cohort generalization. An attention-based MIL mechanism identifies the most outcome-relevant subpopulations, providing interpretability at cell and regulon levels. Applied to 64 CD19-directed CAR T cell infusion products, tcellMIL outperforms pseudobulk and standard MIL baselines, and identifies regulatory programs, such as TBX21, that drive therapeutic outcomes. Our results highlight a generalizable path for outcome prediction from scRNA-seq data where labels exist only at the sample level and cellular distributions deviate from standard atlases. Code: https://github.com/zinagoodlab/tcellMIL