Collaborative Feature and Persona Enhancement for Trustworthy Medical Foundation Models
Abstract
Foundation models promise to democratize access to high-quality medical image segmentation, but they can still exhibit patient-dependent performance differences. We explore whether conditioning a compact U-shaped backbone on simple demographic metadata can improve worst-group segmentation without hurting overall accuracy or incurring significant computational overhead. We propose CEIGM-UNet, a compact U-shaped backbone that interleaves collaborative feature enhancement layers (CFEL) and Group Mamba blocks and integrates a Conditional Feature Recalibration (CFR) module that maps a low-dimensional metadata vector to FiLM-like channel-wise scale and shift parameters. Because public benchmarks rarely provide reliable age or sex labels, we simulate demographic variability on Synapse by grouping cases into small, medium, and large organ-volume subgroups and evaluate fairness using equal opportunity differences and generalized Dice disparities across these volume-defined cohorts. On Synapse and ACDC, our metadata-conditioned CEIGM-UNet achieves competitive Dice and HD95 compared with recent CNN-, Transformer-, and Mamba-based U-Nets, while modestly narrowing performance gaps between volume-based subgroups with negligible overhead. We discuss limitations of this proxy setup and outline how such metadata conditioning could be integrated into larger medical foundation models in a principled, fairness-aware way.