Skip to yearly menu bar Skip to main content

Workshop: I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification

Pitfalls of conditional computation for multi-modal learning

Ivaxi Sheth · Mohammad Havaei · Samira Ebrahimi Kahou


Humans have perfected the art of learning from multiple modalities, through sensory organs. Despite impressive predictive performance on a single modality, neural networks cannot reach human level accuracy with respect to multiple modalities. This is a particularly challenging task due to variations in the structure of respective modalities. A popular method, Conditional Batch Normalization (CBN), was proposed to learn contextual features to aid a deep learning task. This uses the auxiliary data to improve representational power by learning affine transformation for Convolution Neural Networks. Despite the boost in performance by using CBN layer, our work reveals that the visual features learned by introducing auxiliary data via CBN deteriorates. We perform comprehensive experiments to evaluate the brittleness of a dataset to CBN. We show the sensitivity of CBN to the dataset, suggesting that learning from visual features could often be superior for generalization. We perform exhaustive experiments on natural images for bird classification and histology images for cancer type classification. We observe that the CBN network, learns close to no visual features on the bird classification dataset and partial visual features on the histology dataset. Our experiments reveal that CBN may encourage shortcut learning between the auxiliary data and labels.

Chat is not available.